May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

•Download as PPTX, PDF•

3 likes•2,605 views

This document provides an overview of Apache Sqoop and discusses the transition from Sqoop 1 to Sqoop 2. Sqoop is a tool for transferring data between relational databases and Hadoop. Sqoop 1 was connector-based and had challenges around usability and security. Sqoop 2 addresses these with a new architecture separating connections from jobs, centralized metadata management, and role-based security for database access. Sqoop 2 is the primary focus of ongoing development to improve ease of use, extensibility, and security of data transfers with Hadoop.

Technology

1
Apache Sqoop 2: The next
generation of data transfer
Hari Shreedharan
Software Engineer, Cloudera
Contributor, Apache Sqoop
Committer/PMC Member, Apache Flume

2
What is Sqoop?
• Apache Top-Level Project
• SQl to hadOOP
• Tool to transfer data from relational databases
• Teradata, MySQL, PostgreSQL, Oracle, Netezza
• To Hadoop ecosystem
• HDFS (text, sequence file), Hive, HBase, Avro
• And vice versa

Why Sqoop?
• Efficient/Controlled resource utilization
• Concurrent connections, Time of operation
• Datatype mapping and conversion
• Automatic, and User override
• Metadata propagation
• Sqoop Record
• Hive Metastore
• Avro
3

Sqoop 1
• Based on Connectors
• Responsible for Metadata lookups, and Data Transfer
• Majority of connectors are JDBC based
• Non-JDBC (direct) connectors for optimized data transfer
• Connectors responsible for all supported functionality
• HBase Import, Avro Support, ...
5

6
Sqoop 1 Challenges
• Cryptic, contextual command line arguments
• Security concerns
• Type mapping is not clearly defined
• Client needs access to Hadoop binaries/configuration
and database
• JDBC model is enforced

Sqoop 1 Challenges
• Non-uniform functionality
• Different connectors support different capabilities
• Overlapped/Duplicated functionality
• Different connectors may implement same capabilities
differently
• High coupling with Hadoop
• Database vendors required to understand Hadoop
idiosyncrasies in order to build connectors.
7

Sqoop 2 – Design Goals
• Security and Separation of Concerns
• Role based access and use
• Ease of extension
• No low-level Hadoop knowledge needed
• No functional overlap between Connectors
• Ease of Use
• Uniform functionality
• Domain specific interactions
9

10
Sqoop 2: Connection vs Job metadata
There are two distinct sets of options to pass into Sqoop:
Job (distinct per table)Connection (distinct per database)

Sqoop 2: Workings
• Connectors register metadata
• Metadata enables creation of Connections and Jobs
• Connections and Jobs stored in Metadata Repository
• Operator runs Jobs that use appropriate connections
• Admins set policy for connection use
11

12
Sqoop 2: Security
• Support for secure access to external systems via role-
based access to connection objects
• Administrators create/edit/delete connections
• Operators use connections

13
Current Status: Sqoop 2
• Primary focus of the Sqoop Community
• Current Release: 1.99.2
• bits and docs: http://sqoop.apache.org/

What's hot

In this talk of Hadoop User Group UK meeting, Aaron Kimball from Cloudera introduces Sqoop, the open source SQL-to-Hadoop tool. Sqoop helps users perform efficient imports of data from RDBMS sources to Hadoop's distributed file system, where it can be processed in concert with other data sources. Sqoop also allows users to export Hadoop-generated results back to an RDBMS for use with other data pipelines. After this session, users will understand how databases and Hadoop fit together, and how to use Sqoop to move data between these systems. The talk will provide suggestions for best practices when integrating Sqoop and Hadoop in your data processing pipelines. We'll also cover some deeper technical details of Sqoop's architecture, and take a look at some upcoming aspects of Sqoop's development roadmap.

Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK

Skills Matter

Streamline Hadoop DevOps with Apache Ambari

DataWorks Summit/Hadoop Summit

Sqoop tutorial

Ashoka Vanjare

** Hadoop Training: https://www.edureka.co/hadoop ** This Edureka PPT on Sqoop Tutorial will explain you the fundamentals of Apache Sqoop. It will also give you a brief idea on Sqoop Architecture. In the end, it will showcase a demo of data transfer between Mysql and Hadoop Below topics are covered in this video: 1. Problems with RDBMS 2. Need for Apache Sqoop 3. Introduction to Sqoop 4. Apache Sqoop Architecture 5. Sqoop Commands 6. Demo to transfer data between Mysql and Hadoop Check our complete Hadoop playlist here: https://goo.gl/hzUO0m Follow us to never miss an update in the future. Instagram: https://www.instagram.com/edureka_learning/ Facebook: https://www.facebook.com/edurekaIN/ Twitter: https://twitter.com/edurekain LinkedIn: https://www.linkedin.com/company/edureka

Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...

Edureka!

Apache Sqoop: A Data Transfer Tool for Hadoop

Cloudera, Inc.

Monte Zweben Co-Founder and CEO of Splice Machine, will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing. Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update. In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle. HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing. The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions.

January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...

Yahoo Developer Network

HBase and Accumulo | Washington DC Hadoop User Group

Cloudera, Inc.

Presented by Mark Miller, Software Developer, Cloudera Apache Lucene/Solr committer Mark Miller talks about how Solr has been integrated into the Hadoop ecosystem to provide full text search at "Big Data" scale. This talk will give an overview of how Cloudera has tackled integrating Solr into the Hadoop ecosystem and highlights some of the design decisions and future plans. Learn how Solr is getting 'cozy' with Hadoop, which contributions are going to what project, and how you can take advantage of these integrations to use Solr efficiently at "Big Data" scale. Learn how you can run Solr directly on HDFS, build indexes with Map/Reduce, load Solr via Flume in 'Near Realtime' and much more.

The First Class Integration of Solr with Hadoop

lucenerevolution

In these slides is given an overview of the different parts of Apache Spark. We analyze spark shell both in scala and python. Then we consider Spark SQL with an introduction to Data Frame API. Finally we describe Spark Streaming and we make some code examples. Topics:spark-shell, pyspark, HDFS, how to copy file to HDFS, spark transformations, spark actions, Spark SQL (Shark), spark streaming, streaming transformation stateless vs stateful, sliding windows, examples

11. From Hadoop to Spark 2/2

Fabio Fumarola

SQOOP - RDBMS to Hadoop

Sofian Hadiwijaya

HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2017. It is the biggest and most exciting milestone release from the Apache community after 1.0. HBase-2.0 contains a large number of features that is long time in the development, some of which include rewritten region assignment, perf improvements (RPC, rewritten write pipeline, etc), async clients, C++ client, offheaping memstore and other buffers, Spark integration, shading of dependencies as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Existing users of HBase/Phoenix as well as operators managing HBase clusters will benefit the most where they can learn about the new release and the long list of features. We will also briefly cover earlier 1.x release lines and compatibility and upgrade paths for existing users and conclude by giving an outlook on the next level of initiatives for the project.

Meet hbase 2.0

enissoz

Big data: Loading your data with flume and sqoop

Christophe Marchal

Hive on kafka

Szehon Ho

Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Arseny Chernov

Apache Mesos provides a platform for building distributed systems. Mesos is built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elastic Search) with API’s for resource management and scheduling across entire datacenter and cloud environments. How to use that platform and what to make of it becomes a complex task requiring not only understanding of where the system has been but also where it is going. Using schedulers like Marathon and Aurora help to get your applications scheduled and executing on Mesos. In many cases it makes sense to build a framework and integrate directly. This talk will breakdown what is involved in building a framework in Go, how to-do this with examples and why you would want to-do this. Frameworks are not only for generally available software applications (like Kafka, HDFS, Spark ,etc) but should also be used for custom internal R&D built software applications too.

Get started with Developing Frameworks in Go on Apache Mesos

Joe Stein

Flexible and Real-Time Stream Processing with Apache Flink

DataWorks Summit

Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments

DataWorks Summit

Using Morphlines for On-the-Fly ETL

Cloudera, Inc.

Hive on spark berlin buzzwords

Szehon Ho

Apache Ambari manages Hadoop at large-scale and it becomes increasingly difficult for cluster admins to keep the machinery running smoothly as data grows and nodes scale from 30 to 3000 agents. To test at scale, Ambari has a Performance Stack that allows a VM to host as many as 50 Ambari Agents. The simulated stack and 50 Agents per VM can stress-test Ambari Server with the same load as a 3000 node cluster. This talk will cover how to tune the performance of Ambari and MySQL, and share performance benchmarks for features like deploy times, bulk operations, installation of bits, Rolling & Express Upgrade. Moreover, the speaker will show how to use Ambari Metrics System and Grafana to plot performance, detect anomalies, and pinpoint tips on how to improve performance for a more responsive experience. Lastly, the talk will discuss roadmap features in Ambari 3.0 for improving performance and scale.

Tuning Apache Ambari performance for Big Data at scale with 3000 agents

DataWorks Summit

What's hot (20)

Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK

Streamline Hadoop DevOps with Apache Ambari

Sqoop tutorial

Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...

Apache Sqoop: A Data Transfer Tool for Hadoop

January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...

HBase and Accumulo | Washington DC Hadoop User Group

The First Class Integration of Solr with Hadoop

11. From Hadoop to Spark 2/2

SQOOP - RDBMS to Hadoop

Meet hbase 2.0

Big data: Loading your data with flume and sqoop

Hive on kafka

Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Get started with Developing Frameworks in Go on Apache Mesos

Flexible and Real-Time Stream Processing with Apache Flink

Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments

Using Morphlines for On-the-Fly ETL

Hive on spark berlin buzzwords

Tuning Apache Ambari performance for Big Data at scale with 3000 agents

Viewers also liked

Keeley's Scoop #2

seniorbps

re:Invent 2013-foster-madduri

Ravi Madduri

ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)

joseplaborda

Jsm madduri-august-2015

Ravi Madduri

CI4CC sustainability-panel

Ravi Madduri

Internet2 Bio IT 2016 v2

Dan Taylor

Supporting Barack Obama for President

Harvard Medical School, Partners Healthcare

Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS

Ed Dodds

Public.Cdsc.Middleton

Harvard Medical School, Partners Healthcare

Role of Amyloid Burden in cognitive decline

Ravi Madduri

Effective ansible

Wu Bigo

Presentation offered at http://www.smartiotlondon.com/2016-seminar-programme/big-data-and-genomics-the-future-of-genetic-engineering Bioinformatics: the marriage of biology and Big Data, and how this will change the way we perform genetic engineering. This presentation explain our company (Alkol Biotech) compares DNA strands, focusing on its development of the “EunergyCane” sugarcane crop: Europe’s only sugarcane variety. It explain the tools the company uses such as Big Data, Machine Learning, and Fast Sequencing. Learning Outcomes: 1 – Learn on the new field of “Bioinformatics”, which is the marriage of IT and biology 2 – Learn how Big Data is changing the game on genetic engineering 3 – Learn what are the tools used and expected results

Big Data and Genomics

Al Costa

Big Process for Big Data

Ian Foster

HL7: Clinical Decision Support

Harvard Medical School, Partners Healthcare

Leap Motion Development (Rohan Puri)

Camera Culture Group, MIT Media Lab

What is Media in MIT Media Lab, Why 'Camera Culture'

Camera Culture Group, MIT Media Lab

Coded Photography - Ramesh Raskar

Camera Culture Group, MIT Media Lab

What is SIGGRAPH NEXT? By Juliet Fiss What will be the next big thing at SIGGRAPH, and how can the SIGGRAPH community contribute in an impactful way to fields outside of traditional computer graphics? SIGGRAPH NEXT at SIGGRAPH 2015 explored these questions. In this new addition to the SIGGRAPH program, an eclectic set of speakers gave TED-style talks and posed grand challenges to the SIGGRAPH community. In this blog post, Professor Ramesh Raskar of the MIT Media Lab introduces SIGGRAPH NEXT and outlines his vision for it. What will be the next big thing at SIGGRAPH? The SIGGRAPH community has a set of hammers that it uses to solve problems: geometry processing, rendering, animation, and imaging. What will be the next hammer, the next major field of study, appear at SIGGRAPH? Let’s examine where our research ideas come from. Often, advances in machine learning, optimization, signal processing, and optics forge our hammers. Our selection of hammer also depends on the nails we see. The most common application areas of computer graphics currently include computer-aided design, movies, games, and photography. We often ask: “Does this work contribute to SIGGRAPH techniques?” We should also ask, “Does this work contribute SIGGRAPH techniques to _____?” When we answer the challenges posed by these traditional application areas of computer graphics, we are “drinking our own champagne.” We have made amazing progress in these application areas, and we should celebrate! SIGGRAPH NEXT is about finding new varieties of champagne; for that, we need new varieties of grapes. We should invite others from nontraditional and emerging application areas to enjoy our champagne with us, and they will become part of our community. First, we can expand our work in existing areas like mobile, user interaction, virtual reality, fabrication, and new types of cameras. We can also expand into emerging areas such as healthcare, energy, education, entrepreneurship, materials, tissue fabrication, and social media. What’s next? Professor Raskar highlights three top areas where we can make an impact. One big take-home message is that many of these applications involve biology: bio is the new digital, and it will affect us ubiquitously.

What is SIGGRAPH NEXT? Intro by Ramesh Raskar

Camera Culture Group, MIT Media Lab

Globus Genomics: Democratizing NGS Analysis

Ravi Madduri

Ramesh Raskar MIT Media Lab Ramesh Raskar is an Associate Professor at MIT Media Lab. Ramesh Raskar joined the Media Lab from Mitsubishi Electric Research Laboratories in 2008 as head of the Lab’s Camera Culture research group. His research interests span the fields of computational photography, inverse problems in imaging and human-computer interaction. Recent projects and inventions include transient imaging to look around a corner, a next generation CAT-Scan machine, imperceptible markers for motion capture (Prakash), long distance barcodes (Bokode), touch+hover 3D interaction displays (BiDi screen), low-cost eye care devices (Netra,Catra), new theoretical models to augment light fields (ALF) to represent wave phenomena and algebraic rank constraints for 3D displays(HR3D). In 2004, Raskar received the TR100 Award from Technology Review, which recognizes top young innovators under the age of 35, and in 2003, the Global Indus Technovator Award, instituted at MIT to recognize the top 20 Indian technology innovators worldwide. In 2009, he was awarded a Sloan Research Fellowship. In 2010, he received the Darpa Young Faculty award. Other awards include Marr Prize honorable mention 2009, LAUNCH Health Innovation Award, presented by NASA, USAID, US State Dept and NIKE, 2010, Vodafone Wireless Innovation Project Award (first place), 2011. He holds over 50 US patents and has received four Mitsubishi Electric Invention Awards. He is currently co-authoring a book on Computational Photography.

Raskar UIST Keynote 2015 November

Camera Culture Group, MIT Media Lab

Viewers also liked (20)

Keeley's Scoop #2

re:Invent 2013-foster-madduri

ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)

Jsm madduri-august-2015

CI4CC sustainability-panel

Internet2 Bio IT 2016 v2

Supporting Barack Obama for President

Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS

Public.Cdsc.Middleton

Role of Amyloid Burden in cognitive decline

Effective ansible

Big Data and Genomics

Big Process for Big Data

HL7: Clinical Decision Support

Leap Motion Development (Rohan Puri)

What is Media in MIT Media Lab, Why 'Camera Culture'

Coded Photography - Ramesh Raskar

What is SIGGRAPH NEXT? Intro by Ramesh Raskar

Globus Genomics: Democratizing NGS Analysis

Raskar UIST Keynote 2015 November

Similar to May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt

AjajKhan23

Big Data Developers Moscow Meetup 1 - sql on hadoop

bddmoscow

Hadoop ppt1

chariorienit

This is a technical deep dive about Cloudera Impala, the project that makes scalable parallel databse technology available to the Hadoop community for the first time. Impala is an open-sourced code base that allows users to issue low-latency queries to data stored in HDFS and Apache HBase using familiar SQL operators. Presenter Marcel Kornacker, creator of Impala, begins with an overview of Impala from the user's perspective, followed by an overview of Impala's architecture and implementation, and will conclude with a comparison of Impala with Dremel and Apache Hive, commercial MapReduce alternatives and traditional data warehouse infrastructure.

Cloudera Impala: A Modern SQL Engine for Hadoop

Cloudera, Inc.

Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

cdmaxime

http://www.learntek.org/product/big-data-and-hadoop/ http://www.learntek.org Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses. We are dedicated to designing, developing and implementing training programs for students, corporate employees and business professional.

Big data - Online Training

Learntek1

Impala presentation

trihug

This talk discusses the current status of Hadoop security and some exciting new security features that are coming in the next release. First, we provide an overview of current Hadoop security features across the stack, covering Authentication, Authorization and Auditing. Hadoop takes a “defense in depth” approach, so we discuss security at multiple layers: RPC, file system, and data processing. We provide a deep dive into the use of tokens in the security implementation. The second and larger portion of the talk covers the new security features. We discuss the motivation, use cases and design for Authorization improvements in HDFS, Hive and HBase. For HDFS, we describe two styles of ACLs (access control lists) and the reasons for the choice we made. In the case of Hive we compare and contrast two approaches for Hive authrozation.. Further we also show how our approach lends itself to a particular initial implementation choice that has the limitation where the Hive Server owns the data, but where alternate more general implementation is also possible down the road. In the case of HBase, we describe cell level authorization is explained. The talk will be fairly detailed, targeting a technical audience, including Hadoop contributors.

Improvements in Hadoop Security

Chris Nauroth

Stream processing on mobile networks

pbelko82

Cloudera Impala - San Diego Big Data Meetup August 13th 2014

cdmaxime

Cask Webinar Date: 08/10/2016 Link to video recording: https://www.youtube.com/watch?v=XUkANr9iag0 In this webinar, Nitin Motgi, CTO of Cask, walks through the new capabilities of CDAP 3.5 and explains how your organization can benefit. Some of the highlights include: - Enterprise-grade security - Authentication, authorization, secure keystore for storing configurations. Plus integration with Apache Sentry and Apache Ranger. - Preview mode - Ability to preview and debug data pipelines before deploying them. - Joins in Cask Hydrator - Capabilities to join multiple data sources in data pipelines - Real-time pipelines with Spark Streaming - Drag & drop real-time pipelines using Spark Streaming. - Data usage analytics - Ability to report application usage of data sets. - And much more!

Webinar: What's new in CDAP 3.5?

Cask Data

Hpc lunch and learn

John D Almon

Etu Solution Day 2014 Track-D: 掌握Impala和Spark

James Chen

Apache hadoop technology : Beginners

Shweta Patnaik

Apache hadoop technology : Beginners

Shweta Patnaik

Apache hadoop technology : Beginners

Shweta Patnaik

Deploying Grid Services Using Hadoop

George Ang

impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf

ssusere05ec21

A Closer Look at Apache Kudu

Andriy Zabavskyy

Securing Hadoop in an Enterprise Context

Hellmar Becker

Similar to May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools (20)

SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt

Big Data Developers Moscow Meetup 1 - sql on hadoop

Hadoop ppt1

Cloudera Impala: A Modern SQL Engine for Hadoop

Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Big data - Online Training

Impala presentation

Improvements in Hadoop Security

Stream processing on mobile networks

Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Webinar: What's new in CDAP 3.5?

Hpc lunch and learn

Etu Solution Day 2014 Track-D: 掌握Impala和Spark

Apache hadoop technology : Beginners

Deploying Grid Services Using Hadoop

impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf

A Closer Look at Apache Kudu

Securing Hadoop in an Enterprise Context

More from Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media

Yahoo Developer Network

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...

Yahoo Developer Network

Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan

Yahoo Developer Network

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...

Yahoo Developer Network

CICD at Oath using Screwdriver

Yahoo Developer Network

Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request? Vespa (http://www.vespa.ai) allows you to search, organize and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents.

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath

Yahoo Developer Network

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu

Yahoo Developer Network

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool

Yahoo Developer Network

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...

Yahoo Developer Network

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...

Yahoo Developer Network

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath

Yahoo Developer Network

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...

Yahoo Developer Network

Moving the Oath Grid to Docker, Eric Badger, Oath

Yahoo Developer Network

Architecting Petabyte Scale AI Applications

Yahoo Developer Network

Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request? This presentation introduces Vespa (http://vespa.ai) – the open source big data serving engine. Vespa allows you to search, organize, and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents and was recently open sourced at http://vespa.ai.

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...

Yahoo Developer Network

In recent times, YARN Capacity Scheduler has improved a lot in terms of some critical features and refactoring. Here is a quick look into some of the recent changes in scheduler: Global Scheduling Support General placement support Better preemption model to handle resource anomalies across and within queue. Absolute resources’ configuration support Priority support between Queues and Applications In this talk, we will deep dive into each of these new features to give a better picture of their usage and performance comparison. We will also provide some more brief overview about the ongoing efforts and how they can help to solve some of the core issues we face today. Speakers: Sunil Govind (Hortonworks), Jian He (Hortonworks)

Jun 2017 HUG: YARN Scheduling – A Step Beyond

Yahoo Developer Network

In recent years, Yahoo has brought the big data ecosystem and machine learning together to discover mathematical models for search ranking, online advertising, content recommendation, and mobile applications. We use distributed computing clusters with CPUs and GPUs to train these models from 100’s of petabytes of data. A collection of distributed algorithms have been developed to achieve 10-1000x the scale and speed of alternative solutions. Our algorithms construct regression/classification models and semantic vectors within hours, even for billions of training examples and parameters. We have made our distributed deep learning solutions, CaffeOnSpark and TensorFlowOnSpark, available as open source. In this talk, we highlight Yahoo use cases where big data and machine learning technologies are best exemplified. We explain algorithm/system challenges to scale ML algorithms for massive datasets. We provide a technical overview of CaffeOnSpark and TensorFlowOnSpark to jumpstart your journey of large-scale machine learning. Speakers: Andy Feng is a VP of Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected large-scale systems for personalization, ad serving, NoSQL, and cloud infrastructure. Prior to Yahoo, he was a Chief Architect at Netscape/AOL, and Principal Scientist at Xerox. He received a Ph.D. degree in computer science from Osaka University, Japan.

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

Yahoo Developer Network

Spark and SQL-on-Hadoop have made it easier than ever for enterprises to create or migrate apps to the big data stack. Thousands of apps are being generated every day in the form of ETL and modeling pipelines, business intelligence and data cubes, deep machine learning, graph analytics, and real-time data streaming. However, the task of reliably operationalizing these big data apps involves many painpoints. Developers may not have the experience in distributed systems to tune apps for efficiency and performance. Diagnosing failures or unpredictable performance of apps can be a laborious process that involves multiple people. Apps may get stuck or steal resources and cause mission-critical apps to miss SLAs. This talk with introduce the audience to these problems and their common causes. We will also demonstrate how to find and fix these problems quickly, as well as prevent such problems from happening in the first place. Speakers: Dr. Shivnath Babu is a Co-founder and CTO of Unravel and Associate Professor of Computer Science at Duke University. With more than a decade of experience researching the ease of use and manageability of data-intensive systems, he leads the Starfish project at Duke, which pioneered the automation of Hadoop application tuning, problem diagnosis, and resource management. Shivnath has more than 80 peer-reviewed publications to his credit and has received the U.S. National Science Foundation CAREER Award, the HP Labs Innovation Award, and three IBM Faculty Awards.

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...

Yahoo Developer Network

Apache Apex (http://apex.apache.org/) is a stream processing platform that helps organizations to build processing pipelines with fault tolerance and strong processing guarantees. It was built to support low processing latency, high throughput, scalability, interoperability, high availability and security. The platform comes with Malhar library - an extensive collection of processing operators and a wide range of input and output connectors for out-of-the-box integration with an existing infrastructure. In the talk I am going to describe how connectors together with the distributed checkpointing (a mechanism used by the Apex to support fault tolerance and high availability) provide exactly-once end-to-end processing guarantees. Speakers: Vlad Rozov is Apache Apex PMC member and back-end engineer at DataTorrent where he focuses on the buffer server, Apex platform network layer, benchmarks and optimizing the core components for low latency and high throughput. Prior to DataTorrent Vlad worked on distributed BI platform at Huawei and on multi-dimensional database (OLAP) at Hyperion Solutions and Oracle.

February 2017 HUG: Exactly-once end-to-end processing with Apache Apex

Yahoo Developer Network

In the analysis of big data there are problematic queries that don’t scale because they require huge compute resources and time to generate exact results. Examples include count distinct, quantiles, most frequent items, joins, matrix computations, and graph analysis. If approximate results are acceptable, there is a class of sub-linear, stochastic streaming algorithms, called "sketches", that can produce results orders-of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of extracting results for these problem queries in real-time, sketches are the only known solution. For any analysis system that requires these problematic queries from big data, sketches are a required toolkit that should be tightly integrated into the system's analysis capabilities. This technology has helped Yahoo successfully reduce data processing times from days to hours, or minutes to seconds on a number of its internal platforms. This talk covers the current state of our Open Source DataSketches.github.io library, which includes adaptations and example code for Pig, Hive, Spark and Druid and gives architectural examples of use and a case study. Speakers: Jon Malkin is a scientist at Yahoo working to extend the DataSketches library. His previous roles have involved large scale data processing for sponsored search, display advertising, user counting, ad targeting, and cross-device user identity modeling. Alexander Saydakov is a senior software engineer at Yahoo working on the open source Data Sketches project. In his previous roles he has been involved in building large-scale back-end data processing systems and frameworks for data analytics and experimentation based on Torque, Hadoop, Pig, Hive and Druid. Alexander’s education background is in the field of applied mathematics.

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Yahoo Developer Network

More from Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...

Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...

CICD at Oath using Screwdriver

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...

Moving the Oath Grid to Docker, Eric Badger, Oath

Architecting Petabyte Scale AI Applications

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...

Jun 2017 HUG: YARN Scheduling – A Step Beyond

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...

February 2017 HUG: Exactly-once end-to-end processing with Apache Apex

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Recently uploaded

Scalable LLM APIs for AI and Generative AI Application Development Ettikan Karuppiah, Director/Technologist - NVIDIA Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

apidays

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

MadyBayot

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Zilliz

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

Modernizing Securities Finance: The cloud-native prime brokerage platform transforming capital markets. Madhu Subbu, Managing Director, Head of Securities Finance Engineering Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

apidays

Following the popularity of "Cloud Revolution: Exploring the New Wave of Serverless Spatial Data," we're thrilled to announce this much-anticipated encore webinar. In this sequel, we'll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you're building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

panagenda

Automating Google Workspace (GWS) & more with Apps Script

wesley chun

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Zilliz

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

Architecting Cloud Native Applications

WSO2

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

MS Copilot expands with MS Graph connectors

Nanddeep Nachan

DBX First Quarter 2024 Investor Presentation

Dropbox

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

Recently uploaded (20)

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Why Teams call analytics are critical to your entire business

Automating Google Workspace (GWS) & more with Apps Script

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Architecting Cloud Native Applications

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Powerful Google developer tools for immediate impact! (2023-24 C)

MS Copilot expands with MS Graph connectors

DBX First Quarter 2024 Investor Presentation

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Strategies for Landing an Oracle DBA Job as a Fresher

Exploring the Future Potential of AI-Enabled Smartphone Processors

A Year of the Servo Reboot: Where Are We Now?

May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

1. 1 Apache Sqoop 2: The next generation of data transfer Hari Shreedharan Software Engineer, Cloudera Contributor, Apache Sqoop Committer/PMC Member, Apache Flume

2. 2 What is Sqoop? • Apache Top-Level Project • SQl to hadOOP • Tool to transfer data from relational databases • Teradata, MySQL, PostgreSQL, Oracle, Netezza • To Hadoop ecosystem • HDFS (text, sequence file), Hive, HBase, Avro • And vice versa

3. Why Sqoop? • Efficient/Controlled resource utilization • Concurrent connections, Time of operation • Datatype mapping and conversion • Automatic, and User override • Metadata propagation • Sqoop Record • Hive Metastore • Avro 3

4. Sqoop 1 4

5. Sqoop 1 • Based on Connectors • Responsible for Metadata lookups, and Data Transfer • Majority of connectors are JDBC based • Non-JDBC (direct) connectors for optimized data transfer • Connectors responsible for all supported functionality • HBase Import, Avro Support, ... 5

6. 6 Sqoop 1 Challenges • Cryptic, contextual command line arguments • Security concerns • Type mapping is not clearly defined • Client needs access to Hadoop binaries/configuration and database • JDBC model is enforced

7. Sqoop 1 Challenges • Non-uniform functionality • Different connectors support different capabilities • Overlapped/Duplicated functionality • Different connectors may implement same capabilities differently • High coupling with Hadoop • Database vendors required to understand Hadoop idiosyncrasies in order to build connectors. 7

8. Sqoop 2 8

9. Sqoop 2 – Design Goals • Security and Separation of Concerns • Role based access and use • Ease of extension • No low-level Hadoop knowledge needed • No functional overlap between Connectors • Ease of Use • Uniform functionality • Domain specific interactions 9

10. 10 Sqoop 2: Connection vs Job metadata There are two distinct sets of options to pass into Sqoop: Job (distinct per table)Connection (distinct per database)

11. Sqoop 2: Workings • Connectors register metadata • Metadata enables creation of Connections and Jobs • Connections and Jobs stored in Metadata Repository • Operator runs Jobs that use appropriate connections • Admins set policy for connection use 11

12. 12 Sqoop 2: Security • Support for secure access to external systems via role- based access to connection objects • Administrators create/edit/delete connections • Operators use connections

13. 13 Current Status: Sqoop 2 • Primary focus of the Sqoop Community • Current Release: 1.99.2 • bits and docs: http://sqoop.apache.org/

14. 14

Editor's Notes

Acommon problem that new users of Hadoop face is how to get their data into Hadoop. Using a tool called distcp, Hadoop users can efficiently copy data between clusters. This begs the question: how do I get all my data into Hadoop in the first place? One such ingest tool is Sqoop, which was created to efficiently transfer bulk data between Hadoop and external structured datastores because databases are not easily accessible by Hadoop. The canonical use case is performing a nightly dump of all the data in a transactional relational database into Hadoop for offline analysis. The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably. That said, there are further data integration use-cases that Sqoop needs to address as well as become easier to manage and operate.
Bulk data transfer toolImport/Export from/to relational databases, enterprise data warehouses, and NoSQL systemsPopulate tables in HDFS, Hive, and Hbase
* You might ask why Sqoop? What is the additional value?* Why not simply mysqldump
Command line toolClient applicationConfigured via command line arguments (60+!)+ Connector specific extra arguments+ Hadoop -D argumentsPluggable connectors/db specificUsing local Hadoop binaries and configurationWorkflow: command input from client, connect and fetch metadata from db, submit MR job, code generation
* Strength of Sqoop is one tool for all db systems* Heterogenous environments* Scope of connector is not limited to metadata operations* Going back to my previous example, Sqoop to run mysqldump for you* One drawback of this approach, too much power, not using wisely
Driver option is non-intuitive: specifying it means you want the generic connector instead of the specialized built-in connectorSecurity concerns: Openly shared cred, All clients needs to know credentials to DBs, Database access can't be centrally throttledSome connectors may support a certain data format that others don't (e.g. direct MySQL connector can't support sequence files) Sqoop users should not need to concern themselves with Sqoop admin responsibilitiesCouchbase does not need to specify a table name, which is required, causing --table to get overloaded as backfill or dump operation
* Sqoop is a great tool that is doing the job very well* Challenges that both user and developer needs to overcome were opened on previous slide* Too much power means too much responsibility
In Sqoop 2's client server architecture, the client will interact with the server and no longer directly interact with Hadoop nor the database. You'll install and configure connectors and drivers on the server, which will interface with the metastore repository and do resource throttling. Serialization, format conversion, and Hive/HBase integration should be uniformly available via the Sqoop framework.Sqoop 2 will be a service and as such you can install once and then run everywhere. This means that connectors will be configured in one place, managed by the Admin role and run by the Operator role. Likewise, JDBC drivers will be in one place and database connectivity will only be needed on the server.With Sqoop 2, Hive and HBase integration happens not from the client but on the server. In fact, Hive does not need to be installed on Sqoop at all - what Sqoop will do is submit requests to the HiveServer over the wire. Exposing a REST API for operation and management will help Sqoop integrate better with external systems such as Oozie.Oozie and Sqoop will be decoupled: if you install a new Sqoop connector then don’t need to install it in Oozie also.Hive will not invoke anything in Sqoop, while Oozie does invoke Sqoop so the REST API does not benefit Hive in any way but it does benefit Oozie. Which Hive/HBase server the data will be put into is the responsibility of the reduce phase which will have its own configuration and since both these systems have are on Hadoop - we don't need any added security besides passing down the Kerberos principal.
* Cost of improving existing Sqoop 1 would be too high, without clear path to address some challenges
What happens when you submit a Sqoop job – you create a connection and submit a jobWith the user making an explicit connector choice in Sqoop 2, it will be less error-prone and more predictable. Users need not be aware of the functionality of all connectors e.g. Couchbase users need not care that other connectors use tablesConnectors are no longer forced to follow the JDBC model and are no longer required to use common JDBC vocabulary (URL, database, table, etc), regardless if it is applicable.Connections are enabled as first-class objects,encompass credentials,are created once and then used many times for various import/export jobsMetadata is divided into two groups: Connections and Jobs.Sqoop 2 saves the metadata in the server, which houses a metadata repository – so you don’t have to keep typing it nor be responsible for securing it.
* So far we've talked a lot about high level structures, so let's take a look on the workflow of creating the job and how it differs from Sqoop1* Let's discuss how it will work from users perspective
No code generation, no compilation allows Sqoop to run where there are no compilers, which makes it more secure by preventing bad code from runningPreviously required direct access to Hive/HBaseMore secure because routed through Sqoop server rather than opening up access to all clients to perform jobsConnections are created by administrator and used by operator (safeguard credential access from end users),can be restricted in scope based on operation (import/export) - operators cannot abuse credentials. Operators are allowed to create Jobs to import/export specific tables.It prevents misuse and abuse.
Register for HBaseCon 2013If you have questions, Jarcec will have office hours in the Cloudera booth today at 5pm and Kate will have office hours tomorrow at 11am.

May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

Similar to May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools (20)

More from Yahoo Developer Network

More from Yahoo Developer Network (20)

Recently uploaded

Recently uploaded (20)

May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

Editor's Notes