Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

How is Big Data moved around? How are you planning to move it?
This session will focus on familiar and not so similar tools you can use today
for moving and integrating Big Data. Also important to outline the technologies and platform (introduction to Big Data, Hadoop, HDInsight and tools).

We will compare and outline options,
discuss how they can work with your existing Hadoop and Windows Azure
environment, and provide some guidance on when and how to use each of these

  • Be the first to comment

  • Be the first to like this

On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

  1. 1. On the move with Big Data Hadoop, Pig, Sqoop, SSIS… Stéphane Fréchette Thursday February 13, 2014
  2. 2. Who am I? My name is Stéphane Fréchette SQL Server MVP - I’m a Database & Business Intelligence Professional and Founder | CEO of I have a passion for architecting, designing and building solutions that matter. Self proclaimed Open Data Hacker/Advocate I founded Gatineau Ouverte a citizen led initiative which aims to promote open access to civic data of the city of Gatineau. Twitter: @sfrechette Blog: Email:
  3. 3. Session Outline • What is Big Data? • Apache Hadoop • Hadoop Ecosystem • Windows Azure HDInsight • On the move… • SSIS, Sqoop, Pig • Demos • Resources
  4. 4. What is Big Data? 4
  5. 5. Apache Hadoop • Open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models • Designed to scale up from single servers to thousands of machines, each offering local computation and storage
  6. 6. Hadoop Ecosystem • Core components; • HDFS (Hadoop Distributed File System) -> Storage • MapReduce -> Processing
  7. 7. What is Pig? • Write complex MapReduce jobs using a simple script language (Pig Latin) • A platform for analyzing large data sets that consists of high-level language for expressing data analysis programs • Pig translates and compiles complex MapReduce jobs on the fly
  8. 8. What is Sqoop? • Command-line interface application to transfer bulk data between Hadoop and relational datastores
  9. 9. What is Hive? • A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis • Provides an SQL-Like language called HiveQL to query data • Integration between Hadoop and BI and visualization tools
  10. 10. What is SSIS? • SQL Server Integration Services is a platform for data integration and workflow applications. A fast and flexible tool used for data extraction, transformation, and loading (ETL). • Contains rich set of built-in tasks and transformations; tools for constructing packages… • Used to solve complex business problems
  11. 11. Windows Azure HDInsight • HDInsight is a Hadoop-based service from Microsoft that brings a 100 percent Apache Hadoop solution to the cloud • Based on the Hortonworks Data Platform • Scalable, on-demand service
  12. 12. Demos (let’s move some data…)
  13. 13. Resources • • • • • • • • • Apache Projects (list with links) Windows Azure HDInsight HDInsight Tutorials and Guide Hortonworks Sandbox 2.0 Hortonworks Tutorial Gallery Microsoft JDBC Driver 4.0 for SQL Server Microsoft Hive ODBC Driver GitHub: WindowsAzure / azure-content SSIS Custom Task – Disorderly Data (Ken Ross) • GitHub
  14. 14. What Questions Do You Have?
  15. 15. Thank You For attending this session