This article summarizes the concept of service granularity in microservices development and how to integrate between divided services. It also includes how Red Hat's product suite can be used.
This article summarizes the concept of service granularity in microservices development and how to integrate between divided services. It also includes how Red Hat's product suite can be used.
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
"No matter the industry, leading organizations need to closely integrate, deploy, secure, and scale diverse technologies to support workloads while containing costs. Nasdaq, Inc.—a leading provider of trading, clearing, and exchange technology—is no exception.
After migrating more than 1,100 tables from a legacy data warehouse into Amazon Redshift, Nasdaq, Inc. is now implementing a fully-integrated, big data architecture that also includes Amazon S3, Amazon EMR, and Presto to securely analyze large historical data sets in a highly regulated environment. Drawing from this experience, Nasdaq, Inc. shares lessons learned and best practices for deploying a highly secure, unified, big data architecture on AWS.
Attendees learn:
Architectural recommendations to extend an Amazon Redshift data warehouse with Amazon EMR and Presto.
Tips to migrate historical data from an on-premises solution and Amazon Redshift to Amazon S3, making it consumable.
Best practices for securing critical data and applications leveraging encryption, SELinux, and VPC."
Amazon EMR is one of the largest Hadoop operators in the world. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We will also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features.
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
Talk 1 : Evolution of the GoPro's data platform
In this talk, we will share GoPro’s experiences in building Data Analytics Cluster in Cloud. We will discuss: evolution of data platform from fixed-size Hadoop clusters to Cloud-based Spark Cluster with Centralized Hive Metastore +S3: Cost Benefits and DevOp Impact; Configurable, spark-based batch Ingestion/ETL framework;
Migration Streaming framework to Cloud + S3;
Analytics metrics delivery with Slack integration;
BedRock: Data Platform Management, Visualization & Self-Service Portal
Visualizing Machine learning Features via Google Facets + Spark
Speakers: Chester Chen
Chester Chen is the Head of Data Science & Engineering, GoPro. Previously, he was the Director of Engineering at Alpine Data Lab.
David Winters
David is an Architect in the Data Science and Engineering team at GoPro and the creator of their Spark-Kafka data ingestion pipeline. Previously He worked at Apple & Splice Machines.
Hao Zou
Hao is a Senior big data engineer at Data Science and Engineering team. Previously He worked as Alpine Data Labs and Pivotal
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013Amazon Web Services
As troves of data grow exponentially, the number of analytical jobs that process the data also grows rapidly. When you have large teams running hundreds of analytical jobs, coordinating and scheduling those jobs becomes crucial. Using Amazon Simple Workflow Service (Amazon SWF) and AWS Data Pipeline, you can create automated, repeatable, schedulable processes that reduce or even eliminate the custom scripting and help you efficiently run your Amazon Elastic MapReduce (Amazon EMR) or Amazon Redshift clusters. In this session, we show how you can automate your big data workflows. Learn best practices from customers like Change.org, KickStarter and UnSilo on how they use AWS to gain business insights from their data in a repeatable and reliable fashion.
Migrating on premises workload to azure sql databasePARIKSHIT SAVJANI
Azure SQL Database is a fully managed cloud database service with built-in intelligence, elastic scale, performance, reliability, and data protection that enables enterprises and ISVs to reduce their total cost of ownership and operational cost and overheads. In this session, I will share real-world experience of successfully migrated existing SaaS application and on-premises workload for some our tier 1 customers and ISV partners to Azure SQL Database service. The session walks through planning, assessment, migration tools and best practices from the proven experiences and practices of migrating real world applications to Azure SQL Database service.
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
While working with Hadoop, you'll eventually encounter the need to schedule and run workflows to perform various operations like ingesting data or performing ETL. There are a number of tools available to assist you with this type of requirement and one such tool that we at Clairvoyant have been looking to use is Apache Airflow. Apache Airflow is an Apache Incubator project that allows you to programmatically create workflows through a python script. This provides a flexible and effective way to design your workflows with little code and setup. In this talk, we will discuss Apache Airflow and how we at Clairvoyant have utilized it for ETL pipelines on Hadoop.
Evolution of a cloud start up: From C# to Node.jsSteve Jamieson
ComputeNext started 3 years ago to develop the first open marketplace for cloud computing services.
We started by using the technologies we were most familiar with - C# and SQL Server, and our initial architecture and implementation was based on these technologies.
Over time, we have progressively introduced more open source elements, including MongoDB, RabbitMQ and Node.js.
Now we are at the point where most of our back-end services rely on Node.js. The talk will talk about why we did this, how we did this, and discuss our experiences - both good and bad.
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoTAmazon Web Services
We are collecting tons of sensor data from billions of devices. How do you get the value from your IoT data sources? In this session, we will explore different strategies for collecting and ingesting data, understanding its frequency, and leveraging the potential of the cloud to analyze and predict trends and behavior to get most out of your deployed devices.
Orchestrating complex workflows with aws step functionsChris Shenton
We've been doing a lot of work with Lambda, and StepFunctions has given us a way to externalize workflows, allowing decoupling and much faster evolution of workflow and implementation logic. I'll talk about a couple applications we're using it for, and cover Dynamic Parallelism (introduced late last year), which has allowed us to rip out a lot of hairy tracking code. We love it, and think it'll help you in your work too.
A real use case of in-house 2 PB Hadoop Cluster Migration to AWS within few months. AWS is easy-to-use, cost-effective, flexible, scalable and very reliable.Technologies involved are Hive, Presto, Python, Autosys using AWS EMR, AWS Lambda, AWS S3, AWS DynamoDB and AWS SNS.
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...Amazon Web Services
Learning Objectives:
- How to simply scale out your batch workflows on AWS
- How to think about container/job management within managed, high-throughput workflows
- How to build a scalable orchestration framework within AWS Step Functions
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel PartnersCraeg Strong
This case study describes how we leveraged serverless technology and the AWS serverless application model (SAM) to support the needs of virtual training classes for a major US Federal agency. Our firm was excited to be selected as the main training partner to help a major US Federal government agency roll out Agile and DevOps processes across an organization comprising more than 1500 people. And then the pandemic hit—and what was to have been a series of in-person classes turned 100% virtual! We created a set of fully populated docker images containing all of the test data, plugins, and scenarios required for the student exercises. For our initial implementation, we simply pre-loaded our docker images into elastic beanstalk and then replicated them as many times as needed to provide the necessary number of instances for a given class. While this worked out fine at first, we found a number of shortcomings as we scaled up to more students and more classes. Eventually we came up with a much easier solution using serverless technology: we stood up a single page application that could kickoff tasks using AWS step functions to run docker images in elastic container service, all running under AWS Fargate. This application is a perfect fit for serverless technology and describing our evolution to serverless and SAM may help you gain insights into how these technologies may be beneficial in your situation.
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
GoPro’s camera, drone, mobile devices as well as web, desktop applications are generating billions of event logs. The analytics metrics and insights that inform product, engineering, and marketing team decisions need to be distributed quickly and efficiently. We need to visualize the metrics to find the trends or anomalies.
While trying to building up the features store for machine learning, we need to visualize the features, Google Facets is an excellent project for visualizing features. But can we visualize larger feature dataset?
These are issues we encounter at GoPro as part of the data platform evolution. In this talk, we will discuss few of the progress we made at GoPro. We will talk about how to use Slack + Plot.ly to delivery analytics metrics and visualization. And we will also discuss our work to visualize large feature set using Google Facets with Apache Spark.
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...MongoDB
Presented by: Aydrian Howard
Developer Advocate, MongoDB
MongoDB Stitch is a serverless platform designed to help you easily and securely build an application on top of MongoDB Atlas. It lets developers focus on building applications rather than on managing data manipulation code, service integration, or backend infrastructure. MongoDB Stitch also makes it simple to respond to backend changes immediately, allowing you to simplify client side code and build complex flows more easily. This talk will cover ways that MongoDB Stitch helps you respond to changes in your database and take your applications to the next level.
Similar to Building a Sustainable Data Platform on AWS (20)
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
10. Data Platform Use Cases
• Product development
• track KPI such as DAU and MAU
• A/B test for new feature, on-boarding, etc...
• ad-hoc analysis
• Provide data to applications
• realtime re-ranking news articles
• CTR prediction of Ads system
• dashboard service for media partners
11. Data & Its Numbers
• User activities
• ~100 GBs per day (compressed)
• 60+ record types
• User demographics or configurations etc...
• 15M+ records
• Articles metadata
• 100K+ records per day
13. Sustainable Data Platform
• Provide a reliable and scalable "Lambda Architecture"
• Minimize both operation & running cost
• Be open to uncertain future
15. Why Sustainable?
• Do a lot with a few engineers
• no one is a full-time maintainer
• avoid to waste too much time
• Empower brilliant engineers in SmartNews
• everything should be as self-serve as possible
• don't ask for permission, beg for forgiveness
18. Design Principles
• Decoupled "Computation" and "Storage" layers
• multiple consumers can use the same data
• run consumers on Spot Instances
• prevent serious data lost with minimum effort
• Use the right tool for the job
• leverage AWS managed service as possible
• fill in the missing pieces by Presto & PipelineDB
19. An Example
Amazon EMR
AMI 3.x
Amazon S3
Amazon EMR
Hive
General
Users
Application
Engineer
I wanna
upgrade hive
Ad
Engineer
I wanna combine
news data with
ad data
Amazon EMR
AMI 4.x
Amazon EMR
Spark
We’re satisfied
with current
version
Data
Scientist
I wanna test my
algorithm with the
latest spark
Batch Layer
Run multiple EMR clusters for each usages
Kinesis
Stream
Spark
on EMR
AWS
Lambda
Data
Scientist
I wanna consume
streaming data by
Spark
Application
Engineer
I wanna add a
streaming monitor
by Lambda
Speed Layer
Consume the same data for each usages
• AWS managed services
• Replicated data into Multiple AZs
• High availability
21. Collect Events by Fluentd
• Forwarder (running on each instances)
• store JSON events to S3
• forward events to aggregators
• collect metrics and post them to Datadog
• Aggregator
• input events into Kinesis & PipelineDB
• other reporting tasks (not mentioned today)
24. Recommended Practices
• Make configuration simple as possible
• fluentd can cover everything, but shouldn't
• keep stateless
• Use v0.12 or later
• "Filter" : better performance
• "Label": eliminate 'output_tag' configuration
26. Archive to Amazon S3
• I have 2 recommended settings
• versioning
• enable to recover from human error
• lifecycle policy
• minify storage cost
Archives to IA or Gracier
xx days after the creation date
Keep previous versions xx days
Save you in the future!!
28. Various ETL Tasks
• Extract
• dump MySQL records by Embulk
• make files on S3 readable to Hive
• Transform
• transform text files into columnar files (RCFile, ORC)
• generate features for machine learning
• aggregate records (by country, by channel)
• Load
• load aggregated metrics into Amazon Aurora
29. Hive
• Most popular project on Hadoop ecosystem
• famous for its lovely logo :)
• HiveQL and MapReduce
• convert SQL-like query into MR jobs
• Not adopt Tez engine yet
• Amazon EMR doesn't support now
• limited improvement to our queries
30. How to process JSON?
A. Transform into columnar table periodically
• required converting job
• better performance
B. Use JSON-SerDe for temporary analysis
• easy way for querying raw json text files
• required to "drop table" for change schema
• performance is not good
31. Transform Tables
-- Make S3 files readable by Hive
ALTER TABLE raw_activities ADD IF NOT EXISTS PARTITION
(dt='${DATE}', hh='${HOUR}');
-- Transform text files into columnar files (Flatten JSON)
INSERT OVERWRITE TABLE activities
PARTITION (dt='${DATE}', action)
SELECT
user_id, timestamp, os, country,
data,
action
FROM raw_activities
LATERAL VIEW json_tuple(
raw_activities.json,
'userId','timestamp','platform','country','action','data'
) a as user_id, timestamp, os, country, action, data
WHERE dt = '${DATE}'
CLUSTER BY os, country, action, user_id
;
32. JSON-SerDe
-- Define table with SERDE
CREATE TABLE json_table (
country string,
languages array<string>,
religions map<string,array<int>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
-- Result: 10
SELECT religions['catholic'][0] FROM json_table;
33. cf. hive-ruby-scripting
-- Define your ruby (JRuby) script
SET rb.script=
require 'json'
def parse (json)
j = JSON.load(json)
j['profile']['attribute1']
end
;
-- Use the script in HQL
SELECT rb_exec('&parse', json) FROM user;
https://github.com/gree/hive-ruby-scripting
36. Minimize expenses
• Use Spot Instances as possible
• typically discount 50-90%
• select instance type with stable price
• C3 families spike often :(
• Dynamic cluster resizing
• x2 capacity during daily batch job
• 1/2 capacity during midnight
39. Workflow Management
• Define dependencies
• task E is executed after finishing task C and task D
• Scheduling
• task A is kicked after 09:00 AM
• throttle concurrent running of the same task
• Monitoring
• notification in failure
• task C must finish before 01:00 PM (SLA)
cf. http://www.slideshare.net/taroleo/workflow-hacks-1-dots-tokyo
40. Airflow
• A workflow management systems
• define workflow by Python
• built in shiny UI & CLI
• pluggable architecture
http://nerds.airbnb.com/airflow/
41. Define Tasks
dag = DAG('tutorial', default_args=default_args)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
t3 = BashOperator(
task_id='templated',
bash_command="""
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
""",
params={'my_param': 'Parameter I passed in'},
dag=dag)
t2.set_upstream(t1)
t3.set_upstream(t1)
Task
Dependencies
Python code
DAG
45. Alerting to Slack
• SLA Violation
• task A should be done till 00:00 PM
• other team's task K has dependency into task A
• Output validation failure
• stop the following tasks if the output is doubtful
46. Retry from Web UI
Once clear histories, airflow scheduler back fill the histories
47. Retry from CLI
// Clear some histories from 2016-01-01
airflow clear etl_smartnews
--task_regex user_
--downstream
--start_date 2016-01-01
// Backfill uncompleted tasks
airflow backfill etl_smartnews
--start_date 2016-01-01
54. Presto
• A distributed SQL query engine
• join multiple data sources (Hive + MySQL)
• support standard ANSI SQL
• designed to handle TBs or PBs scale data
cf. http://www.slideshare.net/frsyuki/presto-hadoop-conference-japan-2014
55. Presto Architecture
Amazon S3 Kinesis
Stream
Amazon
RDS
Amazon
Aurora
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Coordinator
Client
1. Query with Standard SQL
4. Scan data concurrently
5. Aggregate data without disk I/O
6. Return result to client
2. Generate execution plan
3. Dispatch tasks into multiple workers
Amazon EMR
(Hive Metastore)
Provides Hive table metadata
(S3 access only)
※ https://github.com/qubole/presto-kinesis
※
56. Why Presto?
• Join multiple data sources
• skip large parts of ETL process
• enable to merge Hive/MySQL/Kinesis/PipelineDB
• Low latency
• ~30s to scan billions records in S3
• Low maintenance cost
• stateless, and easy to integrate with Auto Scaling
57. Use case: A/B Test
-- Suppose that this table exists
DESC hive.default.user_activities;
user_id bigint
action varchar
abtest array<map<varchar, bigint>>
url varchar
-- Summarize page view per A/B Test identifier
-- for comparing two algorithms v1 & v2
SELECT
dt,
t['behaviorId'],
count(*) as pv
FROM hive.default.user_activities CROSS JOIN UNNEST(abtest) AS t (t)
WHERE dt like '2016-01-%' AND action = 'viewArticle'
AND t['definitionId'] = 163
GROUP BY dt, t['behaviorId'] ORDER BY dt
;
2015-12-01 | algorithm_v1 | 40000
2015-12-01 | algorithm_v2 | 62000
58. Use case: Troubleshoot
-- Store access logs to S3, and query to them
-- Summarize access & 95pct response time by SQL
SELECT
from_unixtime(timestamp),
count(*) as access,
approx_percentile(reqtime, 0.95) as pct95_reqtime
FROM hive.default.access_log
WHERE dt = '2015-11-04' AND hh = '13' AND role = 'xxx'
GROUP BY timestamp ORDER BY timestamp
;
2015-11-04 22:00:00.000 | 6377 | 0.522
2015-11-04 22:00:01.000 | 3580 | 0.422
60. Presto Covers Everything? No!
• Fixed system on Amazon Aurora (or other RDB)
• provides KPI for products & business
• require high availability & low latency
• has no flexibility
• Ad-hoc system on Presto
• provides access to all dataset on data platform
• require high scalability
• has flexibility (join various data sources)
61. Why Fixed vs Ad-hoc?
• Difficulties on the Ad-hoc only solution
• difficult to prevent heavy queries
• large distinct count exhausts computing resources
• decrease presto maintainability
63. Chartio
• Dashboard as A Service
• helps businesses analyze and track their critical data
• one of AWS partners (※)
• Combine multiple data sources at one dashboard
• Presto, MySQL, Redshift, BigQuery, Elasticsearch ...
• enable to join BigQuery + MySQL internally
• Easy to use for every one
• everyone can make their own dashboard
• write SQL directly / generate query by drag & drop
※ http://www.aws-partner-directory.com/PartnerDirectory/PartnerDetail?id=8959
64. Creating dashboard
1. Building query
(Drag&Drop / SQL)
2. Add step
(filter、sort、modify)
3. Select visualize way
(table、graph)
66. Why Chartio?
• Chartio saves a lot of engineering resources
• before
• maintain in-house dashboard written by rails
• everyone got tired to maintain it
• after
• everyone can build their own dashboard easily
• Chartio's UI is cool
• very important factor for dashboard tool
67. Missing Pieces of Chartio
• No programable API provides
• need to edit dashboard / chart manually
• No rollback feature
• all changes are recorded, but not rollback to the
previous state
• work around : clone => edit => rename
73. Use cases
• Re-rank news articles by user feedback
• track user's positive/negative signal
• consider gender, age, location, interests
• Realtime article monitoring
• detect high bounce rate (may be broken?)
• make realtime reporting dashboard for A/B test
74. Realtime Re-Ranking
ref. Stream 処理 (Spark Streaming + Kinesis) と Offline 処理 (Hive) の統合
www.slideshare.net/smartnews/stremspark-streaming-kinesisofflinehive
Amazon
CloudSearch
Search
API
API
Gateway
Kinesis
Stream
Amazon S3
Amazon EMR
Amazon S3 Amazon EMR
DynamoDB
Realtime
Feedback
Re-rank
Articles
Article
Metadata
User
Interests
User
Behaviors
Offline Procees
by Hive / Spark
76. PipelineDB
• OSS & enterprise streaming SQL database
• PostgreSQL compatible
• connect to Chartio 😍
• join stream to normal PostgreSQL table
• Support probabilistic data structures
• e.g. HyperLogLog
https://www.pipelinedb.com/
http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/
77. Continuous View
-- Calculate unique users seen per media each day
-- Using only a constant amount of space (HyperLogLog)
CREATE CONTINUOUS VIEW uniques AS
SELECT
day(arrival_timestamp),
substring(url from '.*://([^/]*)') as hostname,
COUNT(DISTINCT user_id::integer)
FROM activity_stream GROUP BY day,hostname;
-- How many impressions have we served in the last five minutes?
CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS
SELECT COUNT(*) FROM imps_stream;
-- What are the 90th, 95th, 99th percentiles of request latency?
CREATE CONTINUOUS VIEW latency AS
SELECT
percentile_cont(array[90, 95, 99])
WITHIN GROUP (ORDER BY latency::integer)
FROM latency_stream;
79. Sustainable Data Platform
• build a reliable and scalable lambda architecture
• minimize operation & running cost
• be open to uncertain future
80. My Wishlist to AWS
• Support Reduced Redundancy Storage (RRS) on EMR
• Faster EMR Launch
• Set TTL to DynamoDB records
• Auto-scale Kinesis Stream
• Launch Kinesis Analytics in Tokyo region