SlideShare a Scribd company logo
Av João Paulo II, lote 5, loja 3, 1950-152 Lisboa
www.openup.pt 1
Boot Camp - Data Science using Cloudera
Revisto em Janeiro, 2018
Bootcamp duration
● 10 days
Description
Data scientists build information platforms to ask and answer previously
unimaginable questions. Learn how data science helps organizations reduce costs,
increase efficiency, improve product delivery, improve customers and users
experience, and identify new opportunities. Our bootcamp helps participants
understand what data scientists do and the problems they solve, and to become a
data scientist. Through in-class simulations, participants apply data science methods
to real-world challenges in different scenarios and, ultimately, prepare for data
scientist roles in the field.
This bootcamp is oriented to the different roles on the data science landscape,
Administrators, Developers and Data Analysts.
This bootcamp delivers the key concepts and expertise participants need to ingest
and process data on a Hadoop cluster using the most up-to-date tools and
techniques. Employing Hadoop ecosystem projects such as Spark, Hive, Flume,
Sqoop, and Impala, this training course is the best preparation for the real-world
challenges faced by Hadoop developers. Participants learn to identify which tool is
the right one to use in a given situation, and will gain hands-on experience in
developing using those tools.
Participants will also learn Apache Pig and Hive and Cloudera Impala will teach you
to apply traditional data analytics and business intelligence skills to big data. Cloudera
presents the tools data professionals need to access, manipulate, transform, and
analyze complex data sets using SQL and familiar scripting languages.
Data visualisation is vital in bridging the gap between data and decisions. Discover
the methods, tools and processes involved. Data visualisation is an important visual
method for effective communication and analysing large datasets. Through data
visualisations we are able to draw conclusions from data that are sometimes not
immediately obvious and interact with the data in an entirely different way.
This course will provide you with an informative introduction to the methods, tools
and processes involved in visualising big data
Av João Paulo II, lote 5, loja 3, 1950-152 Lisboa
www.openup.pt 2
Audience
• This course is suitable for system administrators, developers, data analysts,
and statisticians;
• In general to all interested big data and data science;
Prerequisites
● Knowledge on operating systems like Unix/Linux are preferable but non
essential;
● Knowledge in a programming language is preferable but non essential.
Objectives
After conclusions participants will learn:
● How to identify potential business use cases where data science can provide
impactful results;
● How to obtain, clean and combine disparate data sources to create a
coherent picture for analysis;
● What statistical methods to leverage for data exploration that will provide
critical insight into your data;
● Where and when to leverage Hadoop streaming and Apache Spark for data
science pipelines;
● What machine learning technique to use for a particular data science
project;
● How to implement and manage recommenders using Spark’s MLlib, and how
to set up and evaluate data experiments;
● What are the pitfalls of deploying new analytics projects to production, at
scale;
Av João Paulo II, lote 5, loja 3, 1950-152 Lisboa
www.openup.pt 3
● How data is distributed, stored, and processed in a Hadoop cluster;
● How to use Sqoop and Flume to ingest data;
● How to process distributed data with Apache Spark;
● How to model structured data as tables in Impala and Hive;
● How to choose the best data storage format for different data usage
patterns;
● Best practices for data storage;
● The features that Pig, Hive, and Impala offer for data acquisition, storage,
and analysis;
● The fundamentals of Apache Hadoop and data ETL (extract, transform,
load), ingestion, and processing with Hadoop tools
● How Pig, Hive, and Impala improve productivity for typical analysis tasks
● Joining diverse datasets to gain valuable business insight
● Performing real-time, complex queries on datasets
● Use big data and data science visualization tools
Av João Paulo II, lote 5, loja 3, 1950-152 Lisboa
www.openup.pt 4
Course Outline:
Introduction
• About This Course
• About Cloudera
• Course Logistics
• Introductions
Data Science Overview
• What Is Data Science?
• The Growing Need for Data Science
• The Role of a Data Scientist
Introduction to Hadoopand the Hadoop Ecosystem
• Problems with TraditionalLarge-scale Systems
• Hadoop!
• The Hadoop EcoSystem
Hadoop Architecture and HDFS
• Distributed Processing on a Cluster
• Storage: HDFS Architecture
• Storage: Using HDFS
• Resource Management: YARN Architecture
• Resource Management: Working with YARN
Importing Relational Data with Apache Sqoop
• Sqoop Overview
• Basic Imports and Exports
• Limiting Results
• Improving Sqoop’s Performance
• Sqoop 2
Introduction to Impala and Hive
• Introduction to Impala and Hive
• Why Use Impala and Hive?
• Comparing Hive to Traditional Databases
• Hive Use Cases
Modeling and Managing Data with Impala and Hive
• Data Storage Overview
• Creating Databases and Tables
• Loading Data into Tables
• HCatalog
Av João Paulo II, lote 5, loja 3, 1950-152 Lisboa
www.openup.pt 5
• Impala Metadata Caching
Data Formats
• Selecting a File Format
• Hadoop Tool Support for File Formats
• Avro Schemas
• Using Avro with Hive and Sqoop
• Avro Schema Evolution
• Compression
Data Partitioning
• Partitioning Overview
• Partitioning in Impala and Hive
Capturing Data with Apache Flume
• What is Apache Flume?
• Basic Flume Architecture
• Flume Sources
• Flume Sinks
• Flume Channels
• Flume Configuration
Spark Basics
• What is Apache Spark?
• Using the Spark Shell
• RDDs (Resilient Distributed Datasets)
• Functional Programming in Spark
Working with RDDs in Spark
• A Closer Look at RDDs
• Key-Value Pair RDDs
• MapReduce
• Other Pair RDD Operations
Writing and Deploying Spark Applications
• Spark Applications vs. Spark Shell
• Creating the SparkContext
• Building a Spark Application (Scala and Java)
• Running a Spark Application
• The Spark Application Web UI
• Configuring Spark Properties
• Logging
Parallel Programming with Spark
• Review: Spark on a Cluster
• RDD Partitions
• Partitioning of File-based RDDs
• HDFS and Data Locality
• Executing Parallel Operations
• Stages and Tasks
Av João Paulo II, lote 5, loja 3, 1950-152 Lisboa
www.openup.pt 6
Spark Caching and Persistence
• RDD Lineage
• Caching Overview
• Distributed Persistence
Common Patterns in Spark Data Processing
• Common Spark Use Cases
• Iterative Algorithms in Spark
• Graph Processing and Analysis
• Machine Learning
• Example: k-means
Preview: Spark SQL
• Spark SQL and the SQL Context
• Creating DataFrames
•Transforming and Querying DataFrames
• Saving DataFrames
• Comparing Spark SQL with Impala
Introduction to Pig
• What Is Pig?
• Pig’s Features
• Pig Use Cases
• Interacting with Pig
Basic Data Analysis with Pig
• Pig Latin Syntax
• Loading Data
• Simple Data Types
• Field Definitions
• Data Output
• Viewing the Schema
• Filtering and Sorting Data
• Commonly-Used Functions
Processing Complex Data with Pig
• Storage Formats
• Complex/Nested Data Types
• Grouping
• Built-In Functions for Complex Data
• Iterating Grouped Data
Multi-Dataset Operations with Pig
• Techniques for Combining Data Sets
• Joining Data Sets in Pig
• Set Operations
• Splitting Data Sets
Pig Troubleshooting and Optimization
• Troubleshooting Pig
Av João Paulo II, lote 5, loja 3, 1950-152 Lisboa
www.openup.pt 7
• Logging
• Using Hadoop’s Web UI
• Data Sampling and Debugging
• Performance Overview
• Understanding the Execution Plan
• Tips for Improving the Performance of Your Pig Jobs
Introduction to Hive and Impala
• What Is Hive?
• What Is Impala?
• Schema and Data Storage
• Comparing Hive to Traditional Databases
• Hive Use Cases
Querying with Hive and Impala
• Databases and Tables
• Basic Hive and Impala Query Language Syntax
• Data Types
• Differences Between Hive and Impala Query Syntax
• Using Hue to Execute Queries
• Using the Impala Shell
Data Management
• Data Storage
• Creating Databases and Tables
• Loading Data
• Altering Databases and Tables
• Simplifying Queries with Views
• Storing Query Results
Data Storage and Performance
• Partitioning Tables
• Choosing a File Format
• Managing Metadata
• Controlling Access to Data
Relational Data Analysis with Hive and Impala
• Joining Datasets
• Common Built-In Functions
• Aggregation and Windowing
Working with Impala
• How Impala Executes Queries
• Extending Impala with User-Defined Functions
• Improving Impala Performance
Analyzing Text and Complex Data with Hive
• Complex Values in Hive
• Using Regular Expressions in Hive
• Sentiment Analysis and N-Grams
• Conclusion
Av João Paulo II, lote 5, loja 3, 1950-152 Lisboa
www.openup.pt 8
Hive Optimization
• Understanding Query Performance
• Controlling Job Execution Plan
• Bucketing
• Indexing Data
Extending Hive
• SerDes
• Data Transformation with Custom Scripts
• User-Defined Functions
• Parameterized Queries
Choosing the Best Tool for the Job
• Comparing MapReduce, Pig, Hive, Impala, and Relational Databases
• Which to Choose?
Visualizations Tools
• Try different data visualization tools
• Discover the methods, tools and processes involved.
• Choosing the Best Visualization Tool for the Job
Conclusion

More Related Content

What's hot

Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Edureka!
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
ShivanandaVSeeri
 
ODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" SourcesODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" Sources
Mark Rittman
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
Amir Shaikh
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
kammeyer
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
Tom Rogers
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
Maryan Faryna
 
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseBDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
David Lauzon
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Big Data Spain
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
Deborah Akuoko
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Edureka!
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
Dzung Nguyen
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 

What's hot (20)

Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
ODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" SourcesODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" Sources
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
Hadoop Eco system
Hadoop Eco systemHadoop Eco system
Hadoop Eco system
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
 
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseBDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

Similar to Bootcamp Data Science using Cloudera

A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
Impala use case @ edge
Impala use case @ edgeImpala use case @ edge
Impala use case @ edge
Ram Kedem
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
SCAPE Project
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
What is Big Data ?
What is Big Data ?What is Big Data ?
What is Big Data ?
AkhmadZakiAlsafi
 
Hadoop_Architect__eVenkat
Hadoop_Architect__eVenkatHadoop_Architect__eVenkat
Hadoop_Architect__eVenkatVenkat Krishnan
 
Big data analytics_using_hadoop
Big data analytics_using_hadoopBig data analytics_using_hadoop
Big data analytics_using_hadoop
Knowledgehut
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
DataWorks Summit
 
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's ToolkitUsing Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Mark Rittman
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analytics
Kyle Bader
 
Hadoop online training in india
Hadoop online training  in indiaHadoop online training  in india
Hadoop online training in indiaMadhu Trainer
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
neeraj rathore
 
Big Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseBig Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San Jose
Jeffrey T. Pollock
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation Environments
SCAPE Project
 

Similar to Bootcamp Data Science using Cloudera (20)

A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Impala use case @ edge
Impala use case @ edgeImpala use case @ edge
Impala use case @ edge
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
What is Big Data ?
What is Big Data ?What is Big Data ?
What is Big Data ?
 
Hadoop_Architect__eVenkat
Hadoop_Architect__eVenkatHadoop_Architect__eVenkat
Hadoop_Architect__eVenkat
 
Big data analytics_using_hadoop
Big data analytics_using_hadoopBig data analytics_using_hadoop
Big data analytics_using_hadoop
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's ToolkitUsing Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analytics
 
Hadoop online training in india
Hadoop online training  in indiaHadoop online training  in india
Hadoop online training in india
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
Big Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseBig Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San Jose
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation Environments
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 

Bootcamp Data Science using Cloudera

  • 1. Av João Paulo II, lote 5, loja 3, 1950-152 Lisboa www.openup.pt 1 Boot Camp - Data Science using Cloudera Revisto em Janeiro, 2018 Bootcamp duration ● 10 days Description Data scientists build information platforms to ask and answer previously unimaginable questions. Learn how data science helps organizations reduce costs, increase efficiency, improve product delivery, improve customers and users experience, and identify new opportunities. Our bootcamp helps participants understand what data scientists do and the problems they solve, and to become a data scientist. Through in-class simulations, participants apply data science methods to real-world challenges in different scenarios and, ultimately, prepare for data scientist roles in the field. This bootcamp is oriented to the different roles on the data science landscape, Administrators, Developers and Data Analysts. This bootcamp delivers the key concepts and expertise participants need to ingest and process data on a Hadoop cluster using the most up-to-date tools and techniques. Employing Hadoop ecosystem projects such as Spark, Hive, Flume, Sqoop, and Impala, this training course is the best preparation for the real-world challenges faced by Hadoop developers. Participants learn to identify which tool is the right one to use in a given situation, and will gain hands-on experience in developing using those tools. Participants will also learn Apache Pig and Hive and Cloudera Impala will teach you to apply traditional data analytics and business intelligence skills to big data. Cloudera presents the tools data professionals need to access, manipulate, transform, and analyze complex data sets using SQL and familiar scripting languages. Data visualisation is vital in bridging the gap between data and decisions. Discover the methods, tools and processes involved. Data visualisation is an important visual method for effective communication and analysing large datasets. Through data visualisations we are able to draw conclusions from data that are sometimes not immediately obvious and interact with the data in an entirely different way. This course will provide you with an informative introduction to the methods, tools and processes involved in visualising big data
  • 2. Av João Paulo II, lote 5, loja 3, 1950-152 Lisboa www.openup.pt 2 Audience • This course is suitable for system administrators, developers, data analysts, and statisticians; • In general to all interested big data and data science; Prerequisites ● Knowledge on operating systems like Unix/Linux are preferable but non essential; ● Knowledge in a programming language is preferable but non essential. Objectives After conclusions participants will learn: ● How to identify potential business use cases where data science can provide impactful results; ● How to obtain, clean and combine disparate data sources to create a coherent picture for analysis; ● What statistical methods to leverage for data exploration that will provide critical insight into your data; ● Where and when to leverage Hadoop streaming and Apache Spark for data science pipelines; ● What machine learning technique to use for a particular data science project; ● How to implement and manage recommenders using Spark’s MLlib, and how to set up and evaluate data experiments; ● What are the pitfalls of deploying new analytics projects to production, at scale;
  • 3. Av João Paulo II, lote 5, loja 3, 1950-152 Lisboa www.openup.pt 3 ● How data is distributed, stored, and processed in a Hadoop cluster; ● How to use Sqoop and Flume to ingest data; ● How to process distributed data with Apache Spark; ● How to model structured data as tables in Impala and Hive; ● How to choose the best data storage format for different data usage patterns; ● Best practices for data storage; ● The features that Pig, Hive, and Impala offer for data acquisition, storage, and analysis; ● The fundamentals of Apache Hadoop and data ETL (extract, transform, load), ingestion, and processing with Hadoop tools ● How Pig, Hive, and Impala improve productivity for typical analysis tasks ● Joining diverse datasets to gain valuable business insight ● Performing real-time, complex queries on datasets ● Use big data and data science visualization tools
  • 4. Av João Paulo II, lote 5, loja 3, 1950-152 Lisboa www.openup.pt 4 Course Outline: Introduction • About This Course • About Cloudera • Course Logistics • Introductions Data Science Overview • What Is Data Science? • The Growing Need for Data Science • The Role of a Data Scientist Introduction to Hadoopand the Hadoop Ecosystem • Problems with TraditionalLarge-scale Systems • Hadoop! • The Hadoop EcoSystem Hadoop Architecture and HDFS • Distributed Processing on a Cluster • Storage: HDFS Architecture • Storage: Using HDFS • Resource Management: YARN Architecture • Resource Management: Working with YARN Importing Relational Data with Apache Sqoop • Sqoop Overview • Basic Imports and Exports • Limiting Results • Improving Sqoop’s Performance • Sqoop 2 Introduction to Impala and Hive • Introduction to Impala and Hive • Why Use Impala and Hive? • Comparing Hive to Traditional Databases • Hive Use Cases Modeling and Managing Data with Impala and Hive • Data Storage Overview • Creating Databases and Tables • Loading Data into Tables • HCatalog
  • 5. Av João Paulo II, lote 5, loja 3, 1950-152 Lisboa www.openup.pt 5 • Impala Metadata Caching Data Formats • Selecting a File Format • Hadoop Tool Support for File Formats • Avro Schemas • Using Avro with Hive and Sqoop • Avro Schema Evolution • Compression Data Partitioning • Partitioning Overview • Partitioning in Impala and Hive Capturing Data with Apache Flume • What is Apache Flume? • Basic Flume Architecture • Flume Sources • Flume Sinks • Flume Channels • Flume Configuration Spark Basics • What is Apache Spark? • Using the Spark Shell • RDDs (Resilient Distributed Datasets) • Functional Programming in Spark Working with RDDs in Spark • A Closer Look at RDDs • Key-Value Pair RDDs • MapReduce • Other Pair RDD Operations Writing and Deploying Spark Applications • Spark Applications vs. Spark Shell • Creating the SparkContext • Building a Spark Application (Scala and Java) • Running a Spark Application • The Spark Application Web UI • Configuring Spark Properties • Logging Parallel Programming with Spark • Review: Spark on a Cluster • RDD Partitions • Partitioning of File-based RDDs • HDFS and Data Locality • Executing Parallel Operations • Stages and Tasks
  • 6. Av João Paulo II, lote 5, loja 3, 1950-152 Lisboa www.openup.pt 6 Spark Caching and Persistence • RDD Lineage • Caching Overview • Distributed Persistence Common Patterns in Spark Data Processing • Common Spark Use Cases • Iterative Algorithms in Spark • Graph Processing and Analysis • Machine Learning • Example: k-means Preview: Spark SQL • Spark SQL and the SQL Context • Creating DataFrames •Transforming and Querying DataFrames • Saving DataFrames • Comparing Spark SQL with Impala Introduction to Pig • What Is Pig? • Pig’s Features • Pig Use Cases • Interacting with Pig Basic Data Analysis with Pig • Pig Latin Syntax • Loading Data • Simple Data Types • Field Definitions • Data Output • Viewing the Schema • Filtering and Sorting Data • Commonly-Used Functions Processing Complex Data with Pig • Storage Formats • Complex/Nested Data Types • Grouping • Built-In Functions for Complex Data • Iterating Grouped Data Multi-Dataset Operations with Pig • Techniques for Combining Data Sets • Joining Data Sets in Pig • Set Operations • Splitting Data Sets Pig Troubleshooting and Optimization • Troubleshooting Pig
  • 7. Av João Paulo II, lote 5, loja 3, 1950-152 Lisboa www.openup.pt 7 • Logging • Using Hadoop’s Web UI • Data Sampling and Debugging • Performance Overview • Understanding the Execution Plan • Tips for Improving the Performance of Your Pig Jobs Introduction to Hive and Impala • What Is Hive? • What Is Impala? • Schema and Data Storage • Comparing Hive to Traditional Databases • Hive Use Cases Querying with Hive and Impala • Databases and Tables • Basic Hive and Impala Query Language Syntax • Data Types • Differences Between Hive and Impala Query Syntax • Using Hue to Execute Queries • Using the Impala Shell Data Management • Data Storage • Creating Databases and Tables • Loading Data • Altering Databases and Tables • Simplifying Queries with Views • Storing Query Results Data Storage and Performance • Partitioning Tables • Choosing a File Format • Managing Metadata • Controlling Access to Data Relational Data Analysis with Hive and Impala • Joining Datasets • Common Built-In Functions • Aggregation and Windowing Working with Impala • How Impala Executes Queries • Extending Impala with User-Defined Functions • Improving Impala Performance Analyzing Text and Complex Data with Hive • Complex Values in Hive • Using Regular Expressions in Hive • Sentiment Analysis and N-Grams • Conclusion
  • 8. Av João Paulo II, lote 5, loja 3, 1950-152 Lisboa www.openup.pt 8 Hive Optimization • Understanding Query Performance • Controlling Job Execution Plan • Bucketing • Indexing Data Extending Hive • SerDes • Data Transformation with Custom Scripts • User-Defined Functions • Parameterized Queries Choosing the Best Tool for the Job • Comparing MapReduce, Pig, Hive, Impala, and Relational Databases • Which to Choose? Visualizations Tools • Try different data visualization tools • Discover the methods, tools and processes involved. • Choosing the Best Visualization Tool for the Job Conclusion