Apache Tajo on Swift

•

13 likes•2,058 views

Apache Tajo supports OpenStack Swift as one of its data sources. This slide is presented at OpenStack Day in Korea 2015. Outline ● Introduction to OpenStack Swift ● Introduction to Apache Tajo ● Tajo on Swift ● Demo ● Our Roadmap

Engineering

Apache Tajo on Swift
Bringing SQL to the OpenStack World
Jihoon Son
Apache Tajo PMC member

Outline
● OpenStack Swift
● Apache Tajo
● Tajo on Swift
● Demo
● Our Roadmap

OpenStack Swift
● Popular object storage
○ Images, videos, logs, ...
● Enterprises store objects on Swift to provide
their services
○ Usually private clusters

SQL on Swift
● Data analysis is important to improve the quality
of their services
○ SQL is one of the most powerful and popular query
language
● Many enterprise data analysis tools relying on
SQL
○ OLAP, visualization, data mining, …
● Need for using SQL on Swift

Apache Tajo
● Scalable, efficient, and fault-tolerant data
warehouse system
○ Support SQL standards compliance
○ Efficient batch execution and interactive ad-hoc
analysis
■ Low latency and high throughput
■ No use of MapReduce
○ No single point of failure

Apache Tajo
● Active open source project
○ 18 committers and 16 contributors
○ Activity summary

Apache Tajo
Pluggable Storage Layer
...
MasterMasterTajo
Master
Tajo
Worker
Tajo
Worker
Tajo
Worker
Tajo
Worker
...

Tajo on Swift
Pluggable Storage Layer
MasterMasterTajo
Master
Tajo
Worker
Tajo
Worker
Tajo
Worker
Tajo
Worker
...
...
Swift

Tajo on Swift
● No need to modify code of Tajo and Swift
○ Tajo can access Swift with the Hadoop-openstack
library
■ But, doesn’t need to install or run Hadoop
○ Just use it
Swift
Network

Tajo on Swift
● Configuration highlights
○ Swift configuration
■ Need the keystone authentication for the Hadoop
■ No additional configurations
○ HDFS configuration
■ Different cloud providers support
● Key name pattern
fs.swift.service.${provider}

Tajo on Swift
● Configuration highlights
○ Swift configuration
■ Need the keystone authentication for the HDFS client
■ No additional configurations
○ HDFS configuration
■ Different cloud providers support
● Key name pattern
fs.swift.service.${provider}

Tajo on Swift
● Data locality problem
Worker
Storage
Node
Interconnection Network
Node A
Worker
Node B
Storage
Node
Significant
Network
Overhead

Tajo on Swift
● Data locality problem
Worker
Storage
Node
Interconnection Network
Node A
Worker
Node B
Storage
Node

Advanced Integration
● List endpoints middleware
○ Providing the location information of objects,
accounts or containers
■ Tajo workers can directly access each object
○ Example

Advanced Integration
● List endpoints middleware
○ Swift configuration
○
○
○ Hadoop configuration

Advanced Integration
● Location-aware computing
○ Moving the processing close to the data
■ Avoiding the performance degradation due to the data
transfer over the network
○ Important issue when Tajo and Swift share the same
cluster

Location-aware Computing
Storage
Node
Storage
Node
Storage
Node
Query
Master
MasterMasterProxy
Server
Tajo
Worker
Tajo
Worker
Tajo
Worker
Data
location
Data
Swift Cluster Tajo Cluster

Storage
Node
Location-aware Computing
1. Getting object locations from the ring
Query
Master
MasterMasterProxy
Server
Get object locations
Storage
Node
Storage
Node

Location-aware Computing
2. Assigning tasks based on object locations
Query
Master
Worker Worker Worker ...
Storage
Node
Storage
Node
Storage
Node
...
Assign tasks
close to the object
Directly read object data

Our Roadmap
● Storage layer specialized for Swift
● Block storage support
○ Cinder and Ceph
● Provisioning Tajo clusters
○ Sahara
○ Heat, TOSCA

This document discusses using artificial intelligence to optimize queries in BigQuery databases. It describes the benefits and limitations of managed databases like BigQuery. It then presents alternatives like SQL Server, ElasticSearch and Athena. The document outlines best practices for partitioning, clustering and limiting queries in BigQuery. It demonstrates how an AI optimization engine could predict query costs and perform real-time optimizations to scan less data and provide query recommendations. The goal is to make BigQuery faster, smarter and more efficient.

introduction to Neo4j (Tabriz Software Open Talks)

Farzin Bagheri

This document provides an overview of Neo4j, a graph database. It begins with definitions of relational and NoSQL databases, categorizing NoSQL into key-value, document, column-oriented, and graph databases. Graph databases are explained to contain nodes, relationships, and properties. Neo4j is introduced as an example graph database, with Cypher listed as its query language. Examples of using Cypher to create nodes and relationships are provided. Finally, potential uses of Neo4j are listed, including social networks, network analysis, recommendations, and more.

Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...

ScyllaDB

At Kiwi.com we never stop innovating our product and our architecture. Over the past couple of years, we saw a significant rise in technology requirements both globally and internally and had already tried several database solutions. The transformation went from small applications to complex microservices architectures. We first migrated to Cassandra from a big PostgreSQL cluster to get better performance and scalability, but our demands never stopped growing. That is why we decided to go with Scylla. In this talk, I will cover how our team approached testing of Scylla, the migration plan, how it impacts our business and how it influenced our high-level architecture of the application and infrastructure. It has a significant impact on disaster recovery and availability of our overall system.

Plmce2012 scaling pinterest

Mohit Jain

The document describes Pinterest's scaling efforts from 2010 to 2012. It started with a single server on Rackspace and grew to use Amazon Web Services with over 100 web servers and database shards on MySQL, Redis, Memcache and other technologies. Key lessons included keeping systems simple initially and that clustering is difficult due to a single point of failure in the cluster management. Pinterest transitioned to manual sharding of MySQL databases to improve scalability while avoiding the complexity of clustering.

DSpace at ILRI: A semi-technical overview of “CGSpace”

ILRI

Zeotap: Moving to ScyllaDB - A Graph of Billions Scale

ScyllaDB

Zeotap’s Connect product addresses the challenges of identity resolution and linking for AdTech and MarTech. Zeotap manages roughly 20 billion ID and growing. In their presentation, Zeotap engineers will delve into data access patterns, processing and storage requirements to make a case for a graph-based store. They will share the results of PoCs made on technologies such as D-graph, OrientDB, Aeropike and Scylla, present the reasoning for selecting JanusGraph backed by Scylla, and take a deep dive into their data model architecture from the point of ingestion. Learn what is required for the production setup, configuration and performance tuning to manage data at this scale.

Improving ad hoc and production workflows at Stitch Fix

Stitch Fix Algorithms

Stitch Fix aspires to help you find the style that you will love. Data, the backbone of the business, is used to help with styling recommendations, demand modeling, user acquisition, and merchandise planning and also to influence business decisions throughout the organization. These decisions are backed by algorithms and data collected and interpreted based on client preferences. Neelesh Srinivas Salian offers an overview of the compute infrastructure used by the data science team at Stitch Fix, covering the architecture, tools within the larger ecosystem, and the challenges that the team overcame along the way. Apache Spark plays an important role in Stitch Fix’s data platform, and the company’s data scientists use Spark for their ETL and Presto for their ad hoc queries. The goal for the team running the compute infrastructure is to understand and make the data scientists’ lives easier, particularly in terms of usability of Spark, by building tools that expedite the process of getting started with Spark and transitioning from an ad hoc to a production workflow. The compute infrastructure is a part of the data platform that is responsible for all the needs of data scientists as Stitch Fix. Neelesh shares Stitch Fix’s journey, exploring its ad hoc and production infrastructure and detailing its in-house tools and how they work in synergy with open source frameworks in a cloud environment. Neelesh also discusses the additional improvements to the infrastructure that help persist information for future use and optimization and explains how the implementation of Amazon’s EMR FS has helped make it easier to read from the S3 source.

hotdog a TD tool for DD

Treasure Data, Inc.

This document discusses a command line tool called hotdog for interacting with DataDog. It summarizes that hotdog allows users to search for hosts on DataDog using tag expressions and instance IDs. It works by parsing expressions, retrieving host tag mappings from the DataDog API, building an index of host-tag relations, evaluating the expression against the index, and outputting results. The presenter then discusses how Treasure Data uses DataDog for monitoring and is hiring.

This document discusses using Elasticsearch as a time series database. It covers why Elasticsearch was chosen over other options for storing metrics from the open source performance monitoring tool Stagemonitor. The document discusses Elasticsearch's ability to scale, its functions and visualization support in Kibana. It also covers how Stagemonitor's data is modeled in Elasticsearch, including the use of tags, and how index management is handled through a hot/cold node architecture and tools like Curator.

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...

Fwdays

We will start from understanding how Real-Time Analytics can be implemented on Enterprise Level Infrastructure and will go to details and discover how different cases of business intelligence be used in real-time on streaming data. We will cover different Stream Data Processing Architectures and discus their benefits and disadvantages. I'll show with live demos how to build Fast Data Platform in Azure Cloud using open source projects: Apache Kafka, Apache Cassandra, Mesos. Also I'll show examples and code from real projects.

HDP2 and YARN operations point

Treasure Data, Inc.

Spark Core

Todd McGrath

Apache Arrow: In Theory, In Practice

Dremio Corporation

This document discusses Apache Arrow, an open source cross-language development platform for in-memory analytics. It provides an overview of Arrow's goals of being cross-language compatible, optimized for modern CPUs, and enabling interoperability between systems. Key components include core C++/Java libraries, integrations with projects like Pandas and Spark, and common message patterns for sharing data. The document also describes how Arrow is implemented in practice in systems like Dremio's Sabot query engine.

Scylla Summit 2018: Scaling your time series data with Newts

ScyllaDB

Today's datasets are growing at an exponential rate. Collection, storage, analysis, and reporting are becoming more challenging, and the results more valued. A decade ago, RRDTool's algorithms were well-suited to our requirements, but they fall short of scaling to current demands. A new direction is needed, one that prioritizes write-optimized storage, and that scales beyond a single host. This presentation will provide an overview of Newts, a distributed time-series data store based on ScyllaDB, show how it compares to other solutions, and take a look at how it is integrated in OpenNMS.

MongoDB SF Ruby

Mike Dirolf

Data-Driven Development Era and Its Technologies

SATOSHI TAGOMORI

This document discusses data-driven development and the technologies used in the data analytics process. It covers topics like data collection, storage, processing, and visualization. The document advocates using managed cloud services for data and analytics to focus on data instead of managing infrastructure. Choosing technologies should be based on the type of data and problems to solve, not the other way around. Services like Google BigQuery, Amazon Redshift, and Treasure Data are recommended for their ease of use.

Bleeding Edge Databases

Lynn Langit

Austin bdug 2011_01_27_small_and_big_data

Alex Pinkin

Amazon aws big data demystified | Introduction to streaming and messaging flu...

Omid Vahdaty

This document provides an overview of streaming data and messaging concepts including batch processing, streaming, streaming vs messaging, challenges with streaming data, and AWS services for streaming and messaging like Kinesis, Kinesis Firehose, SQS, and Kafka. It discusses use cases and comparisons for these different services. For example, Kinesis is suitable for complex analytics on streaming data while SQS focuses on per-event messaging. Firehose automatically loads streaming data into AWS services like S3 and Redshift without custom coding.

SQL, NoSQL, Distributed SQL: Choose your DataStore carefully

Md Kamaruzzaman

GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival

ScyllaDB

GPS Insight is a leader in fleet vehicle management using IoT. Internally they use a combination of SQL and NoSQL big data technologies, including distributed SQL data analytics via Presto, an open-source query engine developed by Facebook. Learn how to set up, configure, and use Presto with Scylla for supporting ad hoc non-partition key queries for analytics and data scientists. Plus hear how to use Presto for a Data Archival approach with csv files on S3 or similar storage appliance.

ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья Свиридов

GeeksLab Odessa

MagnetoDB is an open source implementation of the Amazon DynamoDB key-value database API for OpenStack. It provides a scalable noSQL database with a schemaless data model and predictable performance. The current version supports basic CRUD operations and data querying. Future work includes adding additional DynamoDB API features and integrating further with OpenStack services. MagnetoDB aims to allow applications using DynamoDB to run on OpenStack.

Presto@Uber

Zhenxiao Luo

Presto is Uber's distributed SQL query engine for their Hadoop data warehouse. Some key points: - Presto allows interactive SQL queries directly on Uber's petabyte-scale Hadoop data lake without needing to first load the data into another database. - It provides fast performance at scale by leveraging columnar data formats like Parquet and optimizing for distributed execution across many nodes. - Uber deployed a 200 node Presto cluster that handles 30,000 queries per day, serving both ad hoc queries and real-time applications accessing data in Hadoop and improving on the performance of alternative solutions like Hive.

Open source big data landscape and possible ITS applications

SoftwareMill

MongoDB Replication fundamentals - Desert Code Camp - October 2014

Avinash Ramineni

MongoDB uses replication to provide high availability and redundancy. The document discusses MongoDB replication fundamentals including replica sets, oplogs, and reading from secondary nodes. It provides an overview of primary/secondary roles in replica sets, how writes are logged to oplogs, and how secondaries replicate by reading the primary's oplog. It also covers read preference settings and write concerns in MongoDB replication.

Presto Fast SQL on Anything

Alluxio, Inc.

This document discusses Presto, an open source distributed SQL query engine. It is used by many large companies like Facebook, Uber, and Netflix for querying large datasets across various data sources. Presto provides high performance through its columnar processing, runtime compilation, and new cost-based optimizer. The document also describes how Presto can be run on AWS and Azure cloud platforms through partnerships with Starburst, who contributed many features to Presto and provides commercial support for enterprises.

How ReversingLabs Serves File Reputation Service for 10B Files

ScyllaDB

ReversingLabs is on a mission to deliver threat intelligence to their users by providing complete visibility and insight into every destructive object. To deliver on their commitment, they migrated to Scylla to handle thousands of updates per second in their processing engines. In their talk, they will go over their requirements and show how they tuned the system to handle requests from their API frontend.

AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive

Omid Vahdaty

This document provides an overview of various AWS big data services including Athena, Redshift Spectrum, EMR, and Hive. It discusses how Athena allows users to run SQL queries directly on data stored in S3 using Presto. Redshift Spectrum enables querying data in S3 using standard SQL from Amazon Redshift. EMR is a managed Hadoop framework that can run Hive, Spark, and other big data applications. Hive provides a SQL-like interface to query data stored in various formats like Parquet and ORC on distributed storage systems. The document demonstrates features and provides best practices for working with these AWS big data services.

テレビに未来はあるか！？momoe

Le Filatrici

guido522

What's hot

Elasticsearch as a time series database

felixbarny

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...

Fwdays

HDP2 and YARN operations point

Treasure Data, Inc.

Spark Core

Todd McGrath

Apache Arrow: In Theory, In Practice

Dremio Corporation

Scylla Summit 2018: Scaling your time series data with Newts

ScyllaDB

MongoDB SF Ruby

Mike Dirolf

Data-Driven Development Era and Its Technologies

SATOSHI TAGOMORI

Bleeding Edge Databases

Lynn Langit

Austin bdug 2011_01_27_small_and_big_data

Alex Pinkin

Amazon aws big data demystified | Introduction to streaming and messaging flu...

Omid Vahdaty

SQL, NoSQL, Distributed SQL: Choose your DataStore carefully

Md Kamaruzzaman

GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival

ScyllaDB

ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья Свиридов

GeeksLab Odessa

Presto@Uber

Zhenxiao Luo

Open source big data landscape and possible ITS applications

SoftwareMill

MongoDB Replication fundamentals - Desert Code Camp - October 2014

Avinash Ramineni

Presto Fast SQL on Anything

Alluxio, Inc.

How ReversingLabs Serves File Reputation Service for 10B Files

ScyllaDB

AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive

Omid Vahdaty

What's hot (20)

Elasticsearch as a time series database

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...

HDP2 and YARN operations point

Spark Core

Apache Arrow: In Theory, In Practice

Scylla Summit 2018: Scaling your time series data with Newts

MongoDB SF Ruby

Data-Driven Development Era and Its Technologies

Bleeding Edge Databases

Austin bdug 2011_01_27_small_and_big_data

Amazon aws big data demystified | Introduction to streaming and messaging flu...

SQL, NoSQL, Distributed SQL: Choose your DataStore carefully

GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival

ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья Свиридов

Presto@Uber

Open source big data landscape and possible ITS applications

MongoDB Replication fundamentals - Desert Code Camp - October 2014

Presto Fast SQL on Anything

How ReversingLabs Serves File Reputation Service for 10B Files

AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive

Viewers also liked

テレビに未来はあるか！？momoe

Le Filatrici

guido522

Vida Sencilla

jverges

Vida Sencilla is a client-based company that provides creative and technical support services for entertainment and corporate events in Spain. They offer a network of suppliers and personnel to help plan all aspects of events from concept to execution. With over 30 years of experience in event production, they can provide services such as talent booking, equipment rental, staging and production management.

Apache Tajo on Swift: Bringing SQL to the OpenStack World

Jihoon Son

This slide was presented at the SK Telecom T Developer Forum. It contains the brief evaluation results of the query execution performance of Tajo on Swift. I conducted two kinds of experiments; The first experiment was to compare the performance of Tajo with on another distributed storage, i.e., HDFS. And the second experiment was the scalability test of Swift. Interestingly, the scan performance on Swift is slower more than two times than that on HDFS. In addition, the task scheduling time on Swift is much greater than that on HDFS, which means the query initialization cost is very high.

대학과 오픈소스

Jihoon Son

Apache tajo configuration

Jihoon Son

This document discusses basic configurations in Apache Tajo 0.11, including cluster resources, concurrent disk access, and garbage collection. It recommends configuring the worker heap size, number of disks per node, minimum memory per task, number of tasks assigned per disk, and temporary directory locations. The document also notes that Tajo works well with default configurations and provides links for more information.

Viewers also liked (6)

テレビに未来はあるか！？

Le Filatrici

Vida Sencilla

Apache Tajo on Swift: Bringing SQL to the OpenStack World

대학과 오픈소스

Apache tajo configuration

Similar to Apache Tajo on Swift

PyConIE 2017 Writing and deploying serverless python applications

Cesar Cardenas Desales

The document provides an overview of serverless computing and deploying Python applications using AWS Lambda. It discusses how serverless computing removes the need to manage servers and allows scaling without capacity planning. The rest of the document demonstrates how to deploy a Python application on AWS Lambda using the Zappa framework. It shows how Zappa handles packaging code and dependencies, deployment, and management of Lambda functions and API Gateway configuration. Some potential issues with serverless like cold starts and limitations on function duration are also covered.

Savanna - Elastic Hadoop on OpenStack

Sergey Lukjanov

Savanna is an OpenStack component that allows elastic provisioning of Hadoop clusters in OpenStack. It has a 3 phase roadmap - phase 1 allows basic cluster provisioning which is complete, phase 2 will add advanced configuration and tool integration currently in progress, and phase 3 will enable analytics as a service with a job execution framework. Savanna uses an extensible plugin architecture to provision Hadoop VMs and configure the clusters, integrating with other OpenStack components like Nova, Glance, and Swift.

Introducing Datawave

Accumulo Summit

Out of the box, Accumulo's strengths are difficult to appreciate without first building an application that showcases its capabilities to handle massive amounts of data. Unfortunately, building such an application is non-trivial for many would-be users, which affects Accumulo's adoption. In this talk, we introduce Datawave, a complete ingest, query, and analytic framework for Accumulo. Datawave, recently open-sourced by the National Security Agency, capitalizes on Accumulo's capabilities, provides an API for working with structured and unstructured data, and boasts a robust, flexible, and scalable backend. We'll do a deep dive into Datawave's project layout, table structures, and APIs in addition to demonstrating the Datawave quickstart—a tool that makes it incredibly easy to hit the ground running with Accumulo and Datawave without having to develop a complete application.

Netflix Big Data Paris 2017

Jason Flittner

Data Science in the Cloud @StitchFix

C4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu. Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com.. Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.

VN Tech Seminor Vol.1

Shuhei Yamashita

This document discusses support technologies for development infrastructure in Vietnam. It introduces provisioning tools, virtual machines, and other serverside technologies that can help build and maintain development environments more efficiently. Specifically, it covers: - Provisioning tools like Chef and Ansible that allow infrastructure to be coded and reproduced automatically. - Virtual machines like Vagrant and Docker that provide isolated environments for applications without the overhead of full virtual machines. - Other technologies like chatbots and continuous integration tools that can enhance development processes. In conclusion, these technologies allow infrastructure maintenance to be less tiresome and costly when used to automate environment setup and testing.

PyConIT 2018 Writing and deploying serverless python applications

Cesar Cardenas Desales

This document provides an overview of serverless computing using AWS Lambda. It begins with defining serverless architecture and its benefits over traditional server-based architectures like reduced maintenance and pay-per-use model. It then demonstrates how to write and deploy Python applications on AWS Lambda using the Zappa framework, covering choosing templates and triggers, adding configuration and code, testing, and deployment. Some pitfalls of the serverless model like cold starts and limits are also discussed. Alternatives to AWS Lambda like Google Cloud Functions and Azure Functions are briefly mentioned.

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Alluxio, Inc.

Alluxio Tech Talk January 21, 2020 Speakers: Matt Fuller, Starburst Dipti Borkar, Alluxio With the advent of the public clouds and data increasingly siloed across many locations -- on premises and in the public cloud -- enterprises are looking for more flexibility and higher performance approaches to analyze their structured data. Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about: - The architecture of Presto, an open source distributed SQL engine - How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics - Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted

Easy Microservices with JHipster - Devoxx BE 2017

Deepu K Sasidharan

Devoxx Belgium 2017 - easy microservices with JHipster

Julien Dubois

Webinar about Spring Data Neo4j 4

GraphAware

The document summarizes Spring Data Neo4j 4.0, a new version of the Spring Data project that provides integration with the Neo4j graph database. It describes Neo4j and Spring Data briefly, then outlines the key features and architecture of SDN 4.0, including a standalone object-graph mapping layer, variable depth persistence, and integration with Spring and repositories. It demonstrates a sample conference application built with SDN 4.0 and provides information on getting started and support resources.

OpenSource and the Cloud ApacheCon.pptx

lohitvijayarenu

This document discusses Twitter's adoption of open source technologies and how it is evolving with extending infrastructure support for the cloud. It provides details on Twitter's use of open source technologies like Apache Hadoop, Apache Spark and Apache Kafka at scale for data processing, storage and analytics. It also discusses Twitter's cloud journey, challenges in areas like metadata integration, data replication at scale, security and tooling for easy onboarding of cloud services. Lastly, it covers topics like a focus on standards, challenges of multi-cloud, and whether to choose all cloud or a hybrid approach going forward.

Netflix Architecture and Open Source

All Things Open

This document provides a summary of Netflix's architecture and use of open source software. It discusses: - Why Netflix open sources software, including gathering feedback, collaboration, and improving retention and recruiting - Popular Netflix open source projects like Eureka, Ribbon, and Hystrix that are widely used in cloud architectures - Netflix's microservices architecture and emphasis on automation, high availability, and continuous delivery - How Netflix ensures operational visibility and security at scale through open source tools like Turbine, Atlas, and Security Monkey - Getting started resources for understanding and running Netflix's technologies like ZeroToCloud and ZeroToDocker workshops

Writing and deploying serverless python applications

Cesar Cardenas Desales

This document provides an overview of serverless computing using AWS Lambda. It discusses what serverless means, how it addresses issues with traditional server-based architectures like capacity planning and scaling. It then covers how to build and deploy serverless Python applications using AWS Lambda, including choosing templates and triggers, adding configuration and code, testing, and writing clients. Alternatives like Google Cloud Functions and the Zappa framework for deploying serverless apps are also mentioned.

CON6423: Scalable JavaScript applications with Project Nashorn

Michel Graciano

In the age of cloud computing and highly demanding systems, some new approaches for application architectures such as the event-driven model have been proposed and successfully implemented with Node.js. With the Nashorn JavaScript engine, it is possible to run JavaScript applications directly in the JVM, enabling access to the latest Node.js frameworks while taking advantage of the Java platform’s scalability, manageability, tools, and extensive collection of Java libraries and middleware. This session demonstrates how to use Nashorn to create highly scalable JavaScript applications leveraging the full power of the JVM by using the projects Avatar and Node.js with Avatar.js and Vert.x, highlighting their key benefits, issues, and challenges.

Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop

Neo4j

This document discusses Apache Hop, an open source data orchestration platform. It provides an overview of Apache Hop's capabilities for managing data pipelines and workflows. Key features highlighted include its modular architecture, support for technologies like Apache Spark and Neo4j, and focus on ease of use, testing, and community development. The roadmap outlines plans to graduate to a top-level Apache project and improve cloud and mobile support.

Designing for operability and managability

Gaurav Bahrani

Hadoop on OpenStack - Sahara @DevNation 2014

spinningmatt

This document provides an overview of Sahara, an OpenStack project that aims to simplify managing Hadoop infrastructure and tools. Sahara allows users to create and manage Hadoop clusters through a programmatic API or web console. It uses a plugin architecture where Hadoop distribution vendors can integrate their management software. Currently there are plugins for vanilla Apache Hadoop, Hortonworks Data Platform, and Intel Distribution for Apache Hadoop. The document outlines Sahara's architecture, APIs, roadmap, and demonstrates its use through a live demo analyzing transaction data with the BigPetStore sample application on Hadoop.

Languages don't matter anymore!

Soluto

Those are slides from Dev.IL meetup talk, by Or Rosenblatt & Yshay Yaacobi from Soluto RND https://www.meetup.com/Dev-IL/events/253252917/ ------------------------- You developed a cool java infrastructure for your team. Your team then shifts to python, so you rewrite the utility in python. Then the team next door asks you to do the same rewrite for their node/typescript service. You ask for a raise and write it again in typescript. Now your colleague reads in HackerNews about the next cool trending language in the block. Ain’t nobody got time for that!!!  Join us to hear how the powerful combination sidecar pattern and Kubernetes can help you solve this issue by allowing different services to use the same utility, regardless of stack or language. You will become stack-free forever!

Snowflake Automated Deployments / CI/CD Pipelines

Drew Hansen

This document discusses using Azure DevOps and Snowflake to enable continuous integration and continuous deployment (CI/CD) of database changes. It covers setting up source control in a repository, implementing pull requests for code reviews, building deployment artifacts in a build pipeline, and deploying artifacts to development, test, and production environments through a release pipeline. The document also highlights key Snowflake features like zero-copy cloning that enable testing deployments before production.

Similar to Apache Tajo on Swift (20)

PyConIE 2017 Writing and deploying serverless python applications

Savanna - Elastic Hadoop on OpenStack

Introducing Datawave

Netflix Big Data Paris 2017

Data Science in the Cloud @StitchFix

VN Tech Seminor Vol.1

PyConIT 2018 Writing and deploying serverless python applications

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Easy Microservices with JHipster - Devoxx BE 2017

Devoxx Belgium 2017 - easy microservices with JHipster

Webinar about Spring Data Neo4j 4

OpenSource and the Cloud ApacheCon.pptx

Netflix Architecture and Open Source

Writing and deploying serverless python applications

CON6423: Scalable JavaScript applications with Project Nashorn

Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop

Designing for operability and managability

Hadoop on OpenStack - Sahara @DevNation 2014

Languages don't matter anymore!

Snowflake Automated Deployments / CI/CD Pipelines

Recently uploaded

学校原版美国波士顿大学毕业证学历学位证书原版一模一样

171ticu

原版一模一样【微信：741003700 】【美国波士顿大学毕业证学历学位证书】【微信：741003700 】学位证，留信认证（真实可查，永久存档）offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原海外各大学 Bachelor Diploma degree, Master Degree Diploma 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

Embedded machine learning-based road conditions and driving behavior monitoring

IJECEIAES

Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.

Electric vehicle and photovoltaic advanced roles in enhancing the financial p...

IJECEIAES

Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network

Gas agency management system project report.pdf

Kamal Acharya

The project entitled "Gas Agency" is done to make the manual process easier by making it a computerized system for billing and maintaining stock. The Gas Agencies get the order request through phone calls or by personal from their customers and deliver the gas cylinders to their address based on their demand and previous delivery date. This process is made computerized and the customer's name, address and stock details are stored in a database. Based on this the billing for a customer is made simple and easier, since a customer order for gas can be accepted only after completing a certain period from the previous delivery. This can be calculated and billed easily through this. There are two types of delivery like domestic purpose use delivery and commercial purpose use delivery. The bill rate and capacity differs for both. This can be easily maintained and charged accordingly.

Welding Metallurgy Ferrous Materials.pdf

AjmalKhan50578

morris_worm_intro_and_source_code_analysis_.pdf

ycwu0509

一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理

nedcocy

原版一模一样【微信：741003700 】【(爱大毕业证书)爱荷华大学毕业证成绩单】【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才办理(爱大毕业证书)爱荷华大学毕业证【微信：741003700 】外观非常简单，由纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理(爱大毕业证书)爱荷华大学毕业证【微信：741003700 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理(爱大毕业证书)爱荷华大学毕业证【微信：741003700 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理(爱大毕业证书)爱荷华大学毕业证【微信：741003700 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

Properties Railway Sleepers and Test.pptx

MDSABBIROJJAMANPAYEL

1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf

MadhavJungKarki

Computational Engineering IITH Presentation

co23btech11018

Digital Twins Computer Networking Paper Presentation.pptx

aryanpankaj78

DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL

ijaia

As digital technology becomes more deeply embedded in power systems, protecting the communication networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3) represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities. Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network (CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to train and test our model. The results of our experiments show that our CNN-LSTM method is much better at finding smart grid intrusions than other deep learning algorithms used for classification. In addition, our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection accuracy rate of 99.50%.

132/33KV substation case study Presentation

kandramariana6

An Introduction to the Compiler Designss

ElakkiaU

Advanced control scheme of doubly fed induction generator for wind turbine us...

IJECEIAES

This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.

AI for Legal Research with applications, tools

mahaffeycheryld

AI applications in legal research include rapid document analysis, case law review, and statute interpretation. AI-powered tools can sift through vast legal databases to find relevant precedents and citations, enhancing research accuracy and speed. They assist in legal writing by drafting and proofreading documents. Predictive analytics help foresee case outcomes based on historical data, aiding in strategic decision-making. AI also automates routine tasks like contract review and due diligence, freeing up lawyers to focus on complex legal issues. These applications make legal research more efficient, cost-effective, and accessible.

Mechanical Engineering on AAI Summer Training Report-003.pdf

21UME003TUSHARDEB

ITSM Integration with MuleSoft.pptx

VANDANAMOHANGOUDA

AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...

Paris Salesforce Developer Group

Software Engineering and Project Management - Introduction, Modeling Concepts...

Prakhyath Rai

Introduction, Modeling Concepts and Class Modeling: What is Object orientation? What is OO development? OO Themes; Evidence for usefulness of OO development; OO modeling history. Modeling as Design technique: Modeling, abstraction, The Three models. Class Modeling: Object and Class Concept, Link and associations concepts, Generalization and Inheritance, A sample class model, Navigation of class models, and UML diagrams Building the Analysis Models: Requirement Analysis, Analysis Model Approaches, Data modeling Concepts, Object Oriented Analysis, Scenario-Based Modeling, Flow-Oriented Modeling, class Based Modeling, Creating a Behavioral Model.

Recently uploaded (20)

学校原版美国波士顿大学毕业证学历学位证书原版一模一样

Embedded machine learning-based road conditions and driving behavior monitoring

Electric vehicle and photovoltaic advanced roles in enhancing the financial p...

Gas agency management system project report.pdf

Welding Metallurgy Ferrous Materials.pdf

morris_worm_intro_and_source_code_analysis_.pdf

一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理

Properties Railway Sleepers and Test.pptx

1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf

Computational Engineering IITH Presentation

Digital Twins Computer Networking Paper Presentation.pptx

DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL

132/33KV substation case study Presentation

An Introduction to the Compiler Designss

Advanced control scheme of doubly fed induction generator for wind turbine us...

AI for Legal Research with applications, tools

Mechanical Engineering on AAI Summer Training Report-003.pdf

ITSM Integration with MuleSoft.pptx

AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...

Software Engineering and Project Management - Introduction, Modeling Concepts...

Apache Tajo on Swift

1. Apache Tajo on Swift Bringing SQL to the OpenStack World Jihoon Son Apache Tajo PMC member

2. Who am I ● Jihoon Son ○ Ph.D candidate (Computer Science & Engineering, 2010.3 ~) ○ Apache Tajo PMC and Committer (2014.5.1 ~) ○ Mentor of Google Summer of Code (2013) ● Contacts ○ Email: jihoonson AT apache.org ○ LinkedIn: https://www.linkedin.com/in/jihoonson

3. Outline ● OpenStack Swift ● Apache Tajo ● Tajo on Swift ● Demo ● Our Roadmap

4. OpenStack Swift ● Popular object storage ○ Images, videos, logs, ... ● Enterprises store objects on Swift to provide their services ○ Usually private clusters

5. SQL on Swift ● Data analysis is important to improve the quality of their services ○ SQL is one of the most powerful and popular query language ● Many enterprise data analysis tools relying on SQL ○ OLAP, visualization, data mining, … ● Need for using SQL on Swift

6. Apache Tajo ● Scalable, efficient, and fault-tolerant data warehouse system ○ Support SQL standards compliance ○ Efficient batch execution and interactive ad-hoc analysis ■ Low latency and high throughput ■ No use of MapReduce ○ No single point of failure

7. Apache Tajo ● Active open source project ○ 18 committers and 16 contributors ○ Activity summary

8. Apache Tajo Pluggable Storage Layer ... MasterMasterTajo Master Tajo Worker Tajo Worker Tajo Worker Tajo Worker ...

9. Tajo on Swift Pluggable Storage Layer MasterMasterTajo Master Tajo Worker Tajo Worker Tajo Worker Tajo Worker ... ... Swift

10. Tajo on Swift ● No need to modify code of Tajo and Swift ○ Tajo can access Swift with the Hadoop-openstack library ■ But, doesn’t need to install or run Hadoop ○ Just use it Swift Network

11. Tajo on Swift ● Configuration highlights ○ Swift configuration ■ Need the keystone authentication for the Hadoop ■ No additional configurations ○ HDFS configuration ■ Different cloud providers support ● Key name pattern fs.swift.service.${provider}

12. Tajo on Swift ● Configuration highlights ○ Swift configuration ■ Need the keystone authentication for the HDFS client ■ No additional configurations ○ HDFS configuration ■ Different cloud providers support ● Key name pattern fs.swift.service.${provider}

13. Tajo on Swift ● Data locality problem Worker Storage Node Interconnection Network Node A Worker Node B Storage Node Significant Network Overhead

14. Tajo on Swift ● Data locality problem Worker Storage Node Interconnection Network Node A Worker Node B Storage Node

15. Advanced Integration ● List endpoints middleware ○ Providing the location information of objects, accounts or containers ■ Tajo workers can directly access each object ○ Example

16. Advanced Integration ● List endpoints middleware ○ Swift configuration ○ ○ ○ Hadoop configuration

17. Advanced Integration ● Location-aware computing ○ Moving the processing close to the data ■ Avoiding the performance degradation due to the data transfer over the network ○ Important issue when Tajo and Swift share the same cluster

18. Location-aware Computing Storage Node Storage Node Storage Node Query Master MasterMasterProxy Server Tajo Worker Tajo Worker Tajo Worker Data location Data Swift Cluster Tajo Cluster

19. Storage Node Location-aware Computing 1. Getting object locations from the ring Query Master MasterMasterProxy Server Get object locations Storage Node Storage Node

20. Location-aware Computing 2. Assigning tasks based on object locations Query Master Worker Worker Worker ... Storage Node Storage Node Storage Node ... Assign tasks close to the object Directly read object data

21. Demo

22. Our Roadmap ● Storage layer specialized for Swift ● Block storage support ○ Cinder and Ceph ● Provisioning Tajo clusters ○ Sahara ○ Heat, TOSCA

23. Thanks! http://tajo.apache.org/

Apache Tajo on Swift

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Apache Tajo on Swift

Similar to Apache Tajo on Swift (20)

Recently uploaded

Recently uploaded (20)

Apache Tajo on Swift