DATA ENGINEER CERTIFICATION
Austin Sun
June 27th 2017
‱ Introduction
‱ Skill set
‱ How to prepare
‱ Register test
‱ Reference & etc.
OUTLINE
‱ Introduction
‱ Skill set
‱ How to prepare
‱ Register test
‱ Reference & etc.
OUTLINE
‱ Data Scientist:
a person employed to analyze and interpret complex digital data,
such as the usage statistics of a website, especially in order to assist
a business in its decision-making.
‱ Data Engineer:
a worker whose primary job responsibilities involve preparing data for
analytical or operational uses. Data engineers enable data scientists
to do their jobs more effectively.
DATA ENGINEER & DATA SCIENTIST
“An experienced open-source developer who earns
the Cloudera Certified Data Engineer credential is
able to perform core competencies required to
ingest, transform, store, and analyze data in
Cloudera's CDH environment. The credential is
earned after successfully passing the CCP Data
Engineer Exam (DE575).” -- Cloudera
CCP DATA ENGINEER
‱ Introduction
‱ Skill set
‱ How to prepare
‱ Register test
‱ Reference & etc.
OUTLINE
‱ Data Ingest
‱ Transform, Stage, Store
‱ Data Analysis
‱ Workflow
SKILL SET
The skills to transfer data between external systems and
your cluster. This includes the following:
‱ Import and export data between an external RDBMS and your cluster, including
the ability to import specific subsets, change the delimiter and file format of
imported data during ingest, and alter the data access pattern or privileges.
‱ Ingest real-time and near-real time (NRT) streaming data into HDFS, including the
ability to distribute to multiple data sources and convert data on ingest from one
format to another.
‱ Load data into and out of HDFS using the Hadoop File System (FS) commands.
DATA INGEST
Convert a set of data values in a given format stored in HDFS
into new data values and/or a new data format and write
them into HDFS or Hive/HCatalog. This includes the following
skills:
‱ Convert data from one file format to another
‱ Write your data with compression
‱ Convert data from one set of values to another (e.g., Lat/Long to Postal Address
using an external library)
‱ Change the data format of values in a data set
TRANSFORM, STAGE, STORE
TRANSFORM, STAGE, STORE
‱ Purge bad records from a data set, e.g., null values
‱ Deduplication and merge data
‱ De-normalize data from multiple disparate data sets
‱ Evolve an Avro or Parquet schema
‱ Partition an existing data set according to one or more partition keys
‱ Tune data for optimal query performance
Filter, sort, join, aggregate, and/or transform one or more data
sets in a given format stored in HDFS to produce a specified
result. All of these tasks may include reading from Parquet, Avro,
JSON, delimited text, and natural language text. The queries will
include complex data types (e.g., array, map, struct), the
implementation of external libraries, partitioned data,
compressed data, and require the use of metadata from
Hive/HCatalog.
DATA ANALYSIS
‱ Write a query to aggregate multiple rows of data
‱ Write a query to calculate aggregate statistics (e.g., average or sum)
‱ Write a query to filter data
‱ Write a query that produces ranked or sorted data
‱ Write a query that joins multiple data sets
‱ Read and/or create a Hive or an HCatalog table from existing data in HDFS
DATA ANALYSIS
The ability to create and execute various jobs and actions that
move data towards greater value and use in a system. This
includes the following skills:
‱ Create and execute a linear workflow with actions that include Hadoop jobs, Hive
jobs, Pig jobs, custom actions, etc.
‱ Create and execute a branching workflow with actions that include Hadoop jobs,
Hive jobs, Pig jobs, custom action, etc.
‱ Orchestrate a workflow to execute regularly at predefined times, including
workflows that have data dependencies
WORK FLOW
‱ Introduction
‱ Skill set
‱ How to prepare
‱ Register test
‱ Reference & etc.
OUTLINE
‱ Familiar with all related command tools
‱ Use Cloudera quickstart VW to practice
‱ Take sample test
‱ Hive, Impala, Sqoop, Spark, Crunch, Pig, Kite, Avro,
Parquet, Cloudera HUE, oozie, Flume, DataFu, JDK 7
API Docs, Python 2.7 , Python 3.4 , Scala
Only the above documentation are accessible during the exam.
FAMILIAR WITH ALL RELATED
COMMAND TOOLS
DOWNLOAD QUICKSTART VM
SAMPLE EXAM QUESTION
‱ Introduction
‱ Skill set
‱ How to prepare
‱ Register test
‱ Reference & etc.
OUTLINE
‱ Create an account at www.examslocal.com.
‱ Select the exam
‱ Choose a date and time
‱ Select a time slot for exam
‱ Pass the compatibility tool and install the screen
sharing Chrome Extension
STEPS TO SCHEDULE EXAM
‱ The exam is remote, it takes less then 2 hour
‱ Partly open book exam.
‱ Some documentation are available online during
the exam
‱ All other websites, including Google/search
functionality is disabled. No notes or other exam
aids.
WHEN EXAM
‱ Introduction
‱ Skill set
‱ How to prepare
‱ Register test
‱ Reference & etc.
OUTLINE
‱ Apache & Cloudera official documents
‱ My website:
https://godataengineer.wordpress.com/
USEFUL LINKS
‱ Data Warehouse for Machine Learning app
‱ Using Flume, Hive, HDFS, Spark and Phoenix
USE CASE
THANK YOU

How to obtain the Cloudera Data Engineer Certification

  • 1.
  • 2.
    ‱ Introduction ‱ Skillset ‱ How to prepare ‱ Register test ‱ Reference & etc. OUTLINE
  • 3.
    ‱ Introduction ‱ Skillset ‱ How to prepare ‱ Register test ‱ Reference & etc. OUTLINE
  • 4.
    ‱ Data Scientist: aperson employed to analyze and interpret complex digital data, such as the usage statistics of a website, especially in order to assist a business in its decision-making. ‱ Data Engineer: a worker whose primary job responsibilities involve preparing data for analytical or operational uses. Data engineers enable data scientists to do their jobs more effectively. DATA ENGINEER & DATA SCIENTIST
  • 6.
    “An experienced open-sourcedeveloper who earns the Cloudera Certified Data Engineer credential is able to perform core competencies required to ingest, transform, store, and analyze data in Cloudera's CDH environment. The credential is earned after successfully passing the CCP Data Engineer Exam (DE575).” -- Cloudera CCP DATA ENGINEER
  • 7.
    ‱ Introduction ‱ Skillset ‱ How to prepare ‱ Register test ‱ Reference & etc. OUTLINE
  • 8.
    ‱ Data Ingest ‱Transform, Stage, Store ‱ Data Analysis ‱ Workflow SKILL SET
  • 9.
    The skills totransfer data between external systems and your cluster. This includes the following: ‱ Import and export data between an external RDBMS and your cluster, including the ability to import specific subsets, change the delimiter and file format of imported data during ingest, and alter the data access pattern or privileges. ‱ Ingest real-time and near-real time (NRT) streaming data into HDFS, including the ability to distribute to multiple data sources and convert data on ingest from one format to another. ‱ Load data into and out of HDFS using the Hadoop File System (FS) commands. DATA INGEST
  • 10.
    Convert a setof data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS or Hive/HCatalog. This includes the following skills: ‱ Convert data from one file format to another ‱ Write your data with compression ‱ Convert data from one set of values to another (e.g., Lat/Long to Postal Address using an external library) ‱ Change the data format of values in a data set TRANSFORM, STAGE, STORE
  • 11.
    TRANSFORM, STAGE, STORE ‱Purge bad records from a data set, e.g., null values ‱ Deduplication and merge data ‱ De-normalize data from multiple disparate data sets ‱ Evolve an Avro or Parquet schema ‱ Partition an existing data set according to one or more partition keys ‱ Tune data for optimal query performance
  • 12.
    Filter, sort, join,aggregate, and/or transform one or more data sets in a given format stored in HDFS to produce a specified result. All of these tasks may include reading from Parquet, Avro, JSON, delimited text, and natural language text. The queries will include complex data types (e.g., array, map, struct), the implementation of external libraries, partitioned data, compressed data, and require the use of metadata from Hive/HCatalog. DATA ANALYSIS
  • 13.
    ‱ Write aquery to aggregate multiple rows of data ‱ Write a query to calculate aggregate statistics (e.g., average or sum) ‱ Write a query to filter data ‱ Write a query that produces ranked or sorted data ‱ Write a query that joins multiple data sets ‱ Read and/or create a Hive or an HCatalog table from existing data in HDFS DATA ANALYSIS
  • 14.
    The ability tocreate and execute various jobs and actions that move data towards greater value and use in a system. This includes the following skills: ‱ Create and execute a linear workflow with actions that include Hadoop jobs, Hive jobs, Pig jobs, custom actions, etc. ‱ Create and execute a branching workflow with actions that include Hadoop jobs, Hive jobs, Pig jobs, custom action, etc. ‱ Orchestrate a workflow to execute regularly at predefined times, including workflows that have data dependencies WORK FLOW
  • 15.
    ‱ Introduction ‱ Skillset ‱ How to prepare ‱ Register test ‱ Reference & etc. OUTLINE
  • 16.
    ‱ Familiar withall related command tools ‱ Use Cloudera quickstart VW to practice ‱ Take sample test
  • 17.
    ‱ Hive, Impala,Sqoop, Spark, Crunch, Pig, Kite, Avro, Parquet, Cloudera HUE, oozie, Flume, DataFu, JDK 7 API Docs, Python 2.7 , Python 3.4 , Scala Only the above documentation are accessible during the exam. FAMILIAR WITH ALL RELATED COMMAND TOOLS
  • 18.
  • 22.
  • 23.
    ‱ Introduction ‱ Skillset ‱ How to prepare ‱ Register test ‱ Reference & etc. OUTLINE
  • 24.
    ‱ Create anaccount at www.examslocal.com. ‱ Select the exam ‱ Choose a date and time ‱ Select a time slot for exam ‱ Pass the compatibility tool and install the screen sharing Chrome Extension STEPS TO SCHEDULE EXAM
  • 25.
    ‱ The examis remote, it takes less then 2 hour ‱ Partly open book exam. ‱ Some documentation are available online during the exam ‱ All other websites, including Google/search functionality is disabled. No notes or other exam aids. WHEN EXAM
  • 26.
    ‱ Introduction ‱ Skillset ‱ How to prepare ‱ Register test ‱ Reference & etc. OUTLINE
  • 27.
    ‱ Apache &Cloudera official documents ‱ My website: https://godataengineer.wordpress.com/ USEFUL LINKS
  • 29.
    ‱ Data Warehousefor Machine Learning app ‱ Using Flume, Hive, HDFS, Spark and Phoenix USE CASE
  • 30.