On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)


Published on

How is Big Data moved around? How are you planning to move it?
This session will focus on familiar and not so similar tools you can use today
for moving and integrating Big Data. Also important to outline the technologies and platform (introduction to Big Data, Hadoop, HDInsight and tools).

We will compare and outline options,
discuss how they can work with your existing Hadoop and Windows Azure
environment, and provide some guidance on when and how to use each of these

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

  1. 1. On the move with Big Data Hadoop, Pig, Sqoop, SSIS… Stéphane Fréchette Thursday February 13, 2014
  2. 2. Who am I? My name is Stéphane Fréchette SQL Server MVP - I’m a Database & Business Intelligence Professional and Founder | CEO of I have a passion for architecting, designing and building solutions that matter. Self proclaimed Open Data Hacker/Advocate I founded Gatineau Ouverte a citizen led initiative which aims to promote open access to civic data of the city of Gatineau. Twitter: @sfrechette Blog: stephanefrechette.com Email: stephanefrechette@ukubu.com
  3. 3. Session Outline • What is Big Data? • Apache Hadoop • Hadoop Ecosystem • Windows Azure HDInsight • On the move… • SSIS, Sqoop, Pig • Demos • Resources
  4. 4. What is Big Data? 4
  5. 5. Apache Hadoop • Open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models • Designed to scale up from single servers to thousands of machines, each offering local computation and storage
  6. 6. Hadoop Ecosystem • Core components; • HDFS (Hadoop Distributed File System) -> Storage • MapReduce -> Processing
  7. 7. What is Pig? • Write complex MapReduce jobs using a simple script language (Pig Latin) • A platform for analyzing large data sets that consists of high-level language for expressing data analysis programs • Pig translates and compiles complex MapReduce jobs on the fly http://pig.apache.org
  8. 8. What is Sqoop? • Command-line interface application to transfer bulk data between Hadoop and relational datastores http://sqoop.apache.org
  9. 9. What is Hive? • A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis • Provides an SQL-Like language called HiveQL to query data • Integration between Hadoop and BI and visualization tools http://hive.apache.org
  10. 10. What is SSIS? • SQL Server Integration Services is a platform for data integration and workflow applications. A fast and flexible tool used for data extraction, transformation, and loading (ETL). • Contains rich set of built-in tasks and transformations; tools for constructing packages… • Used to solve complex business problems
  11. 11. Windows Azure HDInsight • HDInsight is a Hadoop-based service from Microsoft that brings a 100 percent Apache Hadoop solution to the cloud • Based on the Hortonworks Data Platform • Scalable, on-demand service
  12. 12. Demos (let’s move some data…)
  13. 13. Resources • • • • • • • • • Apache Projects (list with links) http://bit.ly/MfpLtE Windows Azure HDInsight http://bit.ly/1dnlAX1 HDInsight Tutorials and Guide http://bit.ly/LWRYol Hortonworks Sandbox 2.0 http://bit.ly/1gkkCte Hortonworks Tutorial Gallery http://bit.ly/1nvMAEX Microsoft JDBC Driver 4.0 for SQL Server http://bit.ly/1kEgJ7O Microsoft Hive ODBC Driver http://bit.ly/NFkhcH GitHub: WindowsAzure / azure-content http://bit.ly/1hfthlF SSIS Custom Task – Disorderly Data (Ken Ross) http://bit.ly/1nvIH2G • GitHub https://github.com/kzhen/SSISHDFS
  14. 14. What Questions Do You Have?
  15. 15. Thank You For attending this session