SlideShare a Scribd company logo
Document by Bash Badawi, December, 30, 2016
Please feel free to share, however, I kindly ask to
reference the source. Email me if you need further
documentation, questions, suggestions. Twitter:
@bashbadawi, LinkedIn Profile, My 4-part Big Data
Articles on LinkedIn comparing Vendors, Stacks, etc,
and Blog on WordPress. Some of the content is lifted
from various sources, yet verifiable Data Scientists.
Unfortunately, I do not have the references to include
in this document. If you are a content provider I
used, please email me to include you in the document.
Use the Table of Contents to easily navigate to the
desired resources.
About Me: A Computer Science/Math Graduate with a
Recent Master’s Degree in Business/Software Economics
and a veteran of the IT industry of over 20 years.
Contents
Document by Bash Badawi, December, 30, 2016..................................................................... 1
Please feel free to share, however, I kindly ask to reference the source.
Email me if you need further documentation, questions, suggestions.
Twitter: @bashbadawi, LinkedIn Profile, My 4-part Big Data Articles on
LinkedIn comparing Vendors, Stacks, etc, and Blog on WordPress. Some of
the content is lifted from various sources, yet verifiable Data Scientists.
Unfortunately, I do not have the references to include in this document. If
you are a content provider I used, please email me to include you in the
document. Use the Table of Contents to easily navigate to the desired
resources. ............................................................................................................................................. 1
About Me: A Computer Science/Math Graduate with a Recent Master’s Degree in
Business/Software Economics and a veteran of the IT industry of over 20
years. ...................................................................................................................................................... 1
Hadoop Training Resources....................................................................................................................... 4
Machine Learning Resources..................................................................................................................... 5
Big Data Lambda Architecture................................................................................................................... 6
The 40 data science techniques ................................................................................................................ 7
Data Science - DSC Resources From Analytics Bridge...............................................................................8
Additional Reading ......................................................................................................................................8
4 Ways to Spot a Fake Data Scientist ........................................................................................................ 9
Unstructured Data Definition .....................................................................................................................9
Resources................................................................................................................................................... 9
You’re Not a Data Scientist...................................................................................................................... 10
Skills needed to be a Data Scientist......................................................................................................... 10
Technical Skills: Analytics..........................................................................................................................10
Technical Skills: Computer Science...........................................................................................................10
Non-Technical Skills...................................................................................................................................10
My Data Science profile which you might want to use in your resume ................................................. 11
Microsoft Big Data Market Play – HDInsight ...........................................................................................12
HDInsight on Linux (Preview)....................................................................................................................12
HDInsight on Windows..............................................................................................................................12
Apache Hadoop..........................................................................................................................................12
 Apache Hadoop - Learn more about the Apache Hadoop software library, a framework that
allows for the distributed processing of large datasets across clusters of computers.......................12
 HDFS - Learn more about the architecture and design of the Hadoop Distributed File System,
the primary storage system used by Hadoop applications..................................................................12
 MapReduce Tutorial - Learn more about the programming framework for writing Hadoop
applications that rapidly process large amounts of data in parallel on large clusters of compute
nodes.......................................................................................................................................................12
SQL Database on Azure .............................................................................................................................12
 Azure SQL Database - MSDN documentation for SQL Database.................................................12
 Management Portal for SQL Database - A lightweight and easy-to-use database management
tool for managing SQL Database in the cloud......................................................................................12
 Adventure Works for SQL Database - Download page for a SQL Database sample database...12
Microsoft Business Intelligence (for HDInsight on Windows)................................................................13
Connect Excel to Hadoop with Power Query.......................................................................................13
Connect Excel to Hadoop with the Microsoft Hive ODBC Driver........................................................13
Microsoft Cloud Platform ......................................................................................................................13
Learn about SQL Server Reporting Services.........................................................................................13
Try HDInsight solutions for big-data analysis (for HDInsight on Windows) ..........................................13
Analyze HVAC sensor data .....................................................................................................................13
Use Hive with HDInsight to analyze website logs .................................................................................13
Analyze sensor data in real-time with Storm and HBase in HDInsight (Hadoop) ...............................13
HDInsight HBase overview MSDN ........................................................................................................... 14
What is HDInsight HBase in Azure? ......................................................................................................14
How is data managed in HDInsight HBase? .........................................................................................14
Scenarios: What are the use cases for HBase? ....................................................................................14
Next steps ...............................................................................................................................................14
Get started with Apache HBase in HDInsight.......................................................................................... 15
Learn how to create HBase tables and query HBase tables by using Hive in HDInsight...................15
NOTE: HBase (version 0.98.0) is only available for use with HDInsight 3.1 clusters on HDInsight
(based on Apache Hadoop and YARN 2.4.0). For version information, see what’s new in the
Hadoop cluster versions provided by HDInsight?................................................................................15
Prerequisites...........................................................................................................................................15
Provision an HBase cluster........................................................................................................................15
To provision an HBase cluster by using the Azure portal .......................................................................15
NOTE: ......................................................................................................................................................16
Hadoop Training Resources
1. http://www.youtube.com/playlist?list=PLF82F6499E89E1BAE
2. Someone started a website for the Hadoop Ecosystem. http://hadoopecosystem.whatazoo.com/.
http://hadoopecosystem.whatazoo.com/home/training
3. https://www.linkedin.com/redirect?url=http%3A%2F%2Fsatya-
hadoop%2Eblogspot%2Ecom%2F2013%2F03%2Fhadoop-training-institutes-in-
india%2Ehtml&urlhash=sJuS&_t=tracking_disc
4. http://university.cloudera.com/training/apache_hive_and_pig/hive_and_pig.html
5. http://www.linalis.com/en/training/planning
6. https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+Demo+VM+for+CDH4
7. http://cloudwick.com/training/
8. http://www.learningtree.com/courses/1250/introduction-to-big-data/
9. www.bisptrainings.com
10. http://www.udemy.com
11. (http://catechnologies.in/big-data.html).
12. http://www.mapr.com/academy/
13. By the way DatumFora also offers live online instructor lead Hadoop Courses. Check it out
athttp://www.datumfora.com/#!online-hadoop-course-oct-26-27/c137j Save 20% when registering with
promocode (LNKD20)
14. http://www.datumfora.com/#!2-day-hadoop-class-oct-19-20/cf4u
15. http://www.ambaricloud.com/
16. http://www.mapr.com/academy/
17. http://www.datumfora.com/#!upcoming-classes/ct0e
18. http://www.learningtree.com/courses/1250/introduction-to-big-data/
19. http://cloudwick.com/training/
20. http://www.linalis.com/en/training/planning
http://university.cloudera.com/training/apache_hive_and_pig/hive_and_pig.html
21. http://www.mapr.com/products/download
22. http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support?action=show&redirec
t=Distribution
23. http://hortonworks.com/blog/install-hadoop-windows-hortonworks-data-platform-2-0/
24. http://hortonworks.com/hdp/downloads/
25. (Try tutorial on http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/) and read more
about Spark GA on HDP (http://hortonworks.com/blog/announcing-apache-spark-now-ga-on-
hortonworks-data-platform/)
26. http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
Machine Learning Resources
Big Data Lambda Architecture
Posted on September 5, 2012 by dbtube
In order to meet the challenges of Big Data, you must rethink data systems from the ground up. You will discover that
some of the most basic ways people manage data in traditional systems like the relational database management
system (RDBMS) is too complex for Big Data systems. The simpler, alternative approach is a new paradigm for Big Data.
In this article based on chapter 1, author Nathan Marz shows you this approach he has dubbed the “lambda
architecture.”
This article is based on Big Data, to be published in Fall 2012. This eBook is available through the Manning Early Access
Program (MEAP). Download the eBook instantly from manning.com. All print book purchases include free digital formats
(PDF, ePub and Kindle). Visit the book’s page for more information based on Big Data. This content is being reproduced
here by permission from Manning Publications.
Author: Nathan Marz
Computing arbitrary functions on an arbitrary dataset in real time is a daunting problem. There is no single tool that
provides a complete solution. Instead, you have to use a variety of tools and techniques to build a complete Big Data
system.
The lambda architecture solves the problem of computing arbitrary functions on arbitrary data in real time by
decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer.
The 40 data science techniques
1. Linear Regression
2. Logistic Regression
3. Jackknife Regression *
4. Density Estimation
5. Confidence Interval
6. Test of Hypotheses
7. Pattern Recognition
8. Clustering - (aka Unsupervised Learning)
9. Supervised Learning
10. Time Series
11. Decision Trees
12. Random Numbers
13. Monte-Carlo Simulation
14. Bayesian Statistics
15. Naive Bayes
16. Principal Component Analysis - (PCA)
17. Ensembles
18. Neural Networks
19. Support Vector Machine - (SVM)
20. Nearest Neighbors - (k-NN)
21. Feature Selection - (aka Variable Reduction)
22. Indexation / Cataloguing *
23. (Geo-) Spatial Modeling
24. Recommendation Engine *
25. Search Engine *
26. Attribution Modeling *
27. Collaborative Filtering *
28. Rule System
29. Linkage Analysis
30. Association Rules
31. Scoring Engine
32. Segmentation
33. Predictive Modeling
34. Graphs
35. Deep Learning
36. Game Theory
37. Imputation
38. Survival Analysis
39. Arbitrage
40. Lift Modeling
41. Yield Optimization
42. Cross-Validation
43. Model Fitting
44. Relevancy Algorithm *
45. Experimental Design
The number of techniques is higher than 40 because we updated the article, and added additional ones.
Data Science - DSC Resources From Analytics Bridge
 Career: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | Jobs
 Knowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSC
 Buzz: Business News | Announcements | Events | RSS Feeds
 Misc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For Bloggers
Additional Reading
 What statisticians think about data scientists
 Data Science Compared to 16 Analytic Disciplines
 10 types of data scientists
 91 job interview questions for data scientists
 50 Questions to Test True Data Science Knowledge
 24 Uses of Statistical Modeling
 21 data science systems used by Amazon to operate its business
 Top 20 Big Data Experts to Follow (Includes Scoring Algorithm)
 5 Data Science Leaders Share their Predictions for 2016 and Beyond
 50 Articles about Hadoop and Related Topics
 10 Modern Statistical Concepts Discovered by Data Scientists
 Top data science keywords on DSC
 4 easy steps to becoming a data scientist
 22 tips for better data science
 How to detect spurious correlations, and how to find the real ones
 17 short tutorials all data scientists should read (and practice)
 High versus low-level data science
Reference: @DataScienceCtrl | @AnalyticBridge
4 Ways to Spot a Fake Data Scientist
I’m here to tell you that from all of my conversations with data scientists and “data scientists” I’ve discovered four
telltale signs that a professional is not a true data scientist:
1. Lack of a highly quantitative advanced degree – It’s incredibly rare for someone without an advanced
quantitative degree to have the technical skills necessary to be a data scientist. In our data science salary
report we found that 88% of data scientists have at least a Master’s degree, and 46% have a Ph.D. The areas
of study may vary, but the vast majority are very rigorous quantitative, technical, or scientific programs,
including Math, Statistics, Computer Science, Engineering, Economics, and Operations Research.
2. No concrete examples of experience with unstructured data – Lists of tools such as Hadoop, Python, and AWS
need to be accompanied by projects that show those skills being put to good use. If a professional cannot
provide clear examples of their experience with unstructured data, or mentions data science projects, but
keeps their involvement very vague, then they are probably not a data scientist. If their specific role in or impact
on a Big Data project is unclear, that is cause for concern.
3. Purely academic or research background – Now, this is not to say that someone with a stellar academic or
research background won’t make a great data scientist, but a key component to being a data scientist in a
corporate setting is business acumen. Understanding how findings affect business goals and delivering
actionable insights to leaders is critical to a data scientist’s success. Many research academics have exceptional
data skills, but without strong business savvy they are not data scientists… yet.
4. List of basic business skills – If I see a list of tools on a “data scientist” resume like Omniture, Google Analytics,
SPSS, Excel, or any other Microsoft Office tool, you can be sure that I will take a harder look at whether or not
this professional makes the grade. These skills are basic business qualifications that are insufficient for most data
science positions, and by themselves are not indicative of a true data scientist.
Unstructured Data Definition
Unstructured Data (or unstructured information) refers to information that either does not have a pre-
defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but
may contain data such as dates, numbers, and facts as well.
Resources
1. Advanced Degree – More Data Science programs are popping up to serve the current demand, but there
are also many Mathematics, Statistics, and Computer Science programs.
2. MOOCs –Coursera, Udacity, and code academy are good places to start.
3. Certifications – KDnuggets has compiled an extensive list.
4. Bootcamps – For more information about how this approach compares to degree programs or MOOCs, check
out this guest blog from the data scientists at Datascope Analytics.
5. Kaggle – Kaggle hosts data science competitions where you can practice, hone your skills with messy, real
world data, and tackle actual business problems. Employers take Kaggle rankings seriously, as they can be seen
as relevant, hands-on project work.
6. LinkedIn Groups – Join relevant groups to interact with other members of the data science community.
7. Data Science Central and KDnuggets – Data Science Central and KDnuggets are good resources for staying
at the forefront of industry trends in data science.
8. The Burtch Works Study: Salaries of Data Scientists – If you’re looking for more information about the salaries
and demographics of current data scientists be sure to download our data scientist salary study.
You’re Not a Data Scientist
The IT biz has historically rebranded job titles based upon what’s trending — today’s Software Architects were once
known as Designers or Systems Engineers. Nothing is trending faster and louder than predictive analytics, machine
learning, deep learning and AI. So it’s our turn to rebrand data geeks as data scientists. Now don’t get me wrong — some
of these folks are legit Data Scientists but the majority is not. I guess I’m a purist –calling yourself a scientist indicates
that you practice science following a scientific method. You create hypotheses, test the hypothesis with experimental
results and after proving or disproving the conjecture move on or iterate.
Skills needed to be a Data Scientist
Technical Skills: Analytics
1. Education – Data scientists are highly educated – 88% have at least a Master’s degree and 46% have PhDs – and
while there are notable exceptions, a very strong educational background is usually required to develop the
depth of knowledge necessary to be a data scientist. Their most common fields of study are Mathematics and
Statistics (32%), followed by Computer Science (19%) and Engineering (16%).
2. SAS and/or R – In-depth knowledge of at least one of these analytical tools, for data science R is generally
preferred.
Technical Skills: Computer Science
3. Python Coding – Python is the most common coding language I typically see required in data science roles, along
with Java, Perl, or C/C++.
4. Hadoop Platform – Although this isn’t always a requirement, it is heavily preferred in many cases. Having
experience with Hive or Pig is also a strong selling point. Familiarity with cloud tools such as Amazon S3 can
also be beneficial.
5. SQL Database/Coding – Even though NoSQL and Hadoop have become a large component of data science, it is
still expected that a candidate will be able to write and execute complex queries in SQL.
6. Unstructured data – It is critical that a data scientist be able to work with unstructured data, whether it is from
social media, video feeds or audio.
Non-Technical Skills
7. Intellectual curiosity – No doubt you’ve seen this phrase everywhere lately, especially as it relates to data
scientists. Frank Lo describes what it means, and talks about other necessary “soft skills” in his guest
blog posted a few months ago.
8. Business acumen – To be a data scientist you’ll need a solid understanding of the industry you’re working in,
and know what business problems your company is trying to solve. In terms of data science, being able to
discern which problems are important to solve for the business is critical, in addition to identifying new ways the
business should be leveraging its data.
9. Communication skills – Companies searching for a strong data scientist are looking for someone who can clearly
and fluently translate their technical findings to a non-technical team, such as the Marketing or Sales
departments. A data scientist must enable the business to make decisions by arming them with quantified
insights, in addition to understanding the needs of their non-technical colleagues in order to wrangle the data
appropriately. Check out our recent flash survey for more information on communication skills for
quantitative professionals.
My Data Science profile which you might want to use in your resume
Microsoft Big Data Market Play – HDInsight
I highly recommend HDInsight it for the non-Linux Windows developers.
Machine Learning on Azure abstracts away a lot of the Big Data
complexity and allows you to jump up to final analysis levels, i.e. 6-7
steps in Hadoop for 2 steps in HDInsight
HDInsight on Linux (Preview)
 Get started with HDInsight on Linux - A quick-start tutorial for provisioning HDInsight Hadoop clusters on
Linux and running sample Hive queries.
 Provision HDInsight on Linux using custom options - Learn how to provision an HDInsight Hadoop cluster on
Linux by using custom options through the Azure Management Portal, Azure cross-platform command line,
or Azure
 Working with HDInsight on Linux - Get some quick tips on working with Hadoop Linux clusters provisioned
on Azure.
 Manage HDInsight clusters using Ambari - Learn how to monitor and manage your Linux-based Hadoop on
HDInsight cluster by using Ambari Web, or the Ambari REST API.
HDInsight on Windows
 HDInsight documentation - The documentation page for Azure HDInsight with links to articles, videos, and
more resources.
 Learning map for HDInsight - A guided tour of Hadoop documentation for HDInsight.
 Get started with Azure HDInsight - A quick-start tutorial for using Hadoop in HDInsight.
 Run the HDInsight samples - A tutorial on how to run the samples that ship with HDInsight.
 Azure HDInsight SDK - Reference documentation for the HDInsight SDK.
Apache Hadoop
 Apache Hadoop - Learn more about the Apache Hadoop software library, a framework that allows for the
distributed processing of large datasets across clusters of computers.
 HDFS - Learn more about the architecture and design of the Hadoop Distributed File System, the primary
storage system used by Hadoop applications.
 MapReduce Tutorial - Learn more about the programming framework for writing Hadoop applications
that rapidly process large amounts of data in parallel on large clusters of compute nodes.
SQL Database on Azure
 Azure SQL Database - MSDN documentation for SQL Database.
 Management Portal for SQL Database - A lightweight and easy-to-use database management tool for managing
SQL Database in the cloud.
 Adventure Works for SQL Database - Download page for a SQL Database sample database.
Microsoft Business Intelligence (for HDInsight on Windows)
Familiar business intelligence (BI) tools - such as Excel, PowerPivot, SQL Server Analysis Services, and SQL Server
Reporting Services - retrieve, analyze, and report data integrated with HDInsight by using either the Power Query add-in
or the Microsoft Hive ODBC Driver.
These BI tools can help in your big-data analysis:
Connect Excel to Hadoop with Power Query
 Learn how to connect Excel to the Azure Storage account that stores the data associated with your HDInsight
cluster by using Microsoft Power Query for Excel.
Connect Excel to Hadoop with the Microsoft Hive ODBC Driver
 Learn how to import data from HDInsight with the Microsoft Hive ODBC Driver.
Microsoft Cloud Platform
 Learn about Power BI for Office 365, download the SQL Server trial, and set up SharePoint Server 2013 and SQL
Server BI.
 Learn more about SQL Server Analysis Services.
Learn about SQL Server Reporting Services
Try HDInsight solutions for big-data analysis (for HDInsight on Windows)
Analyze data from your organization to gain insights into your business. Here are some examples:
Analyze HVAC sensor data
Learn how to analyze sensor data by using Hive with HDInsight (Hadoop), and then visualize the data in Microsoft Excel.
In this sample, you'll use Hive to process historical data produced by HVAC systems to see which systems can't reliably
maintain a set temperature.
Use Hive with HDInsight to analyze website logs
Learn how to use HiveQL in HDInsight to analyze website logs to get insight into the frequency of visits in a day from
external websites, and a summary of website errors that the users experience.
Analyze sensor data in real-time with Storm and HBase in HDInsight (Hadoop)
Learn how to build a solution that uses a Storm cluster in HDInsight to process sensor data from Azure Event Hubs, and
then displays the processed sensor data as near-real-time information on a web-based dashboard.
To try Hadoop on HDInsight, see "Get started" articles in the Explore section on the HDInsight documentation page. To
try more advanced examples, scroll down to the Analyze section.
HDInsight HBase overview MSDN
HBase is an Apache, open-source, NoSQL database that is built on Hadoop. HBase provides random access and strong
consistency for large amounts of unstructured and semistructured data. It was modeled on Google's BigTable, and it is
a column-family-oriented database. Data is stored in the rows of a table, and data within a row is grouped by column
family. HBase is a schema-less database in the sense that neither the columns nor the type of data stored in them need
to be defined before using them. The open-source code scales linearly to handle petabytes of data on thousands of
nodes. It can rely on data redundancy, batch processing, and other features that are provided by distributed applications
in the Hadoop ecosystem.
What is HDInsight HBase in Azure?
HDInsight HBase is offered as a managed cluster that is integrated into the Azure environment. The clusters are
configured to store data directly in Azure Blob storage, which provides low latency and increased elasticity in
performance and cost choices. This enables customers to build interactive websites that work with large datasets, to
build services that store sensor and telemetry data from millions of end points, and to analyze this data with Hadoop
jobs. HBase and Hadoop are good starting points for big data project in Azure; in particular, they can enable real-time
applications to work with large datasets.
The HDInsight implementation leverages the scale-out architecture of HBase to provide automatic sharding of tables,
strong consistency for reads and writes, and automatic failover. Performance is enhanced by in-memory caching for
reads and high-throughput streaming for writes. Virtual network provisioning is also available for HDInsight HBase. For
details, see Provision HDInsight clusters on Azure Virtual Network.
How is data managed in HDInsight HBase?
Data can be managed in HBase by using the Create, Get, Put, and Scan commands from the HBase shell. Data is written
to the database by using put and read by using get. The scan command is used to obtain data from multiple rows in a
table. Data can also be managed using the HBase C# API, which provides a client library on top of the HBase REST API.
An HBase database can also be queried by using Hive. For an introduction to these programming models, see Get
started using HBase with Hadoop in HDInsight. Co-processors are also available, which allow data processing in the
nodes that host the database.
Scenarios: What are the use cases for HBase?
The canonical use case for which BigTable (and by extension, HBase) was created was web search. Search engines
build indexes that map terms to the web pages that contain them. But there are many other use cases that HBase is
suitable for—several of which are itemized in this section.
 Key-value store HBase can be used as a key-value store, and it is suitable for managing message systems.
Facebook uses HBase for their messaging system, and it is ideal for storing and managing Internet
communications. WebTable uses HBase to search for and manage tables that are extracted from webpages.
 Sensor data HBase is useful for capturing data that is collected incrementally from various sources. This
includes social analytics, time series, keeping interactive dashboards up-to-date with trends and counters,
and managing audit log systems. Examples include Bloomberg trader terminal and the Open Time Series
Database (OpenTSDB), which stores and provides access to metrics collected about the health of server
systems.
 Real-time query Phoenix is a SQL query engine for Apache HBase. It is accessed as a JDBC driver, and it
enables querying and managing HBase tables by using SQL.
 HBase as a platform Applications can run on top of HBase by using it as a datastore. Examples include
Phoenix, OpenTSDB, Kiji, and Titan. Applications can also integrate with HBase. Examples include Hive, Pig,
Solr, Storm, Flume, Impala, Spark, Ganglia, and Drill.
Next steps
 Get started using HBase with Hadoop in HDInsight
 Provision HDInsight clusters on Azure Virtual Network
 Configure HBase replication in HDInsight
 Analyze Twitter sentiment with HBase in HDInsight
 Use Maven to build Java applications that use HBase with HDInsight (Hadoop)
Get started with Apache HBase in HDInsight
Learn how to create HBase tables and query HBase tables by using Hive in HDInsight.
HBase is a low-latency NoSQL database that allows online transactional processing of big data. HBase is offered as a
managed cluster that is integrated into the Azure environment. The clusters are configured to store data directly in
Azure Blob storage, which provides low latency and increased elasticity in performance and cost choices. This enables
customers to build interactive websites that work with large datasets, to build services that store sensor and telemetry
data from millions of end points, and to analyze this data with Hadoop jobs. For more information about HBase and the
scenarios it can be used for, see HDInsight HBase overview.
NOTE: HBase (version 0.98.0) is only available for use with HDInsight 3.1 clusters on HDInsight (based on Apache Hadoop
and YARN 2.4.0). For version information, see what’s new in the Hadoop cluster versions provided by HDInsight?
Prerequisites
Before you begin this tutorial, you must have the following:
 An Azure subscription: For more information about obtaining a subscription, see Purchase Options, Member
Offers, or Free Trial.
 An Azure storage account: For instructions, see How To Create a Storage Account.
 A workstation with Visual Studio 2013 installed: For instructions, see Installing Visual Studio.
Provision an HBase cluster
NOTE:
1. The steps in this article create an HDInsight cluster by using basic configuration settings. For
information about other cluster configuration settings (such as using Azure virtual network
or a metastore for Hive and Oozie), see Provision Hadoop clusters in HDInsight by using custom
options.
To provision an HBase cluster by using the Azure portal
1. Sign in to the Azure portal.
2. Click NEW in the lower left, and then click DATA SERVICES > HDINSIGHT > HBASE.
You can also use the CUSTOM CREATE option (The above is the older classic portal, the below is the new portal
using the Resource Manager Construct)
1. Enter CLUSTER NAME, CLUSTER SIZE, CLUSTER USER PASSWORD, and STORAGE ACCOUNT.
The default HTTP USER NAME is admin. You can customize the name by using the CUSTOM CREATION option.
WARNING:
For high availability of HBase services, you must provision a cluster that contains at least three nodes. This ensures that,
if one node goes down, the HBase data regions are available on other nodes.
1. Click the checkmark icon in the lower right to create the HBase cluster.
NOTE:
After an HBase cluster is deleted, you can create another HBase cluster by using the same default blob. The new cluster
will pick up the HBase tables you created in the original cluster.

More Related Content

Viewers also liked

朋友是永恆的感動
朋友是永恆的感動朋友是永恆的感動
朋友是永恆的感動t828vp
 
Graduate Students Workshop
Graduate Students Workshop Graduate Students Workshop
Graduate Students Workshop
Naz Torabi
 
Yahoo mobile & broadcast surround
Yahoo mobile & broadcast surroundYahoo mobile & broadcast surround
Yahoo mobile & broadcast surroundDevan McCoy
 
Bibliotheken moeten naar buiten toe
Bibliotheken moeten naar buiten toeBibliotheken moeten naar buiten toe
Bibliotheken moeten naar buiten toe
Erna Winters
 
Making Progress Towards Standardised Train Control
Making Progress Towards Standardised Train ControlMaking Progress Towards Standardised Train Control
Making Progress Towards Standardised Train Controlrobtepas
 
Social Media Infographics
Social Media InfographicsSocial Media Infographics
Social Media Infographics
Valentin Vesa
 
Moeller bosc2010 debian_taverna
Moeller bosc2010 debian_tavernaMoeller bosc2010 debian_taverna
Moeller bosc2010 debian_tavernaBOSC 2010
 
From Forests to Farms, and Back Again: Land Use Change in the Hudson Valley
From Forests to Farms, and Back Again: Land Use Change in the Hudson Valley From Forests to Farms, and Back Again: Land Use Change in the Hudson Valley
From Forests to Farms, and Back Again: Land Use Change in the Hudson Valley
Cary Institute of Ecosystem Studies
 
Web / Graphic design credentials
Web / Graphic design credentialsWeb / Graphic design credentials
Web / Graphic design credentials
Vinod Batus
 
Pictures of students in sw 475
Pictures of students in sw 475Pictures of students in sw 475
Pictures of students in sw 475pegart
 
Tharisa platinum mine expansion project 2012
Tharisa platinum mine expansion project 2012Tharisa platinum mine expansion project 2012
Tharisa platinum mine expansion project 2012
AGE Technologies
 
Jason Yip Portfolio
Jason Yip PortfolioJason Yip Portfolio
Jason Yip Portfolio
jasonyip
 
Knjiga evidencije se kci ja
Knjiga evidencije se kci jaKnjiga evidencije se kci ja
Knjiga evidencije se kci jazaDruga
 

Viewers also liked (16)

朋友是永恆的感動
朋友是永恆的感動朋友是永恆的感動
朋友是永恆的感動
 
Graduate Students Workshop
Graduate Students Workshop Graduate Students Workshop
Graduate Students Workshop
 
Yahoo mobile & broadcast surround
Yahoo mobile & broadcast surroundYahoo mobile & broadcast surround
Yahoo mobile & broadcast surround
 
Bibliotheken moeten naar buiten toe
Bibliotheken moeten naar buiten toeBibliotheken moeten naar buiten toe
Bibliotheken moeten naar buiten toe
 
Making Progress Towards Standardised Train Control
Making Progress Towards Standardised Train ControlMaking Progress Towards Standardised Train Control
Making Progress Towards Standardised Train Control
 
Social Media Infographics
Social Media InfographicsSocial Media Infographics
Social Media Infographics
 
Intro to Drush
Intro to DrushIntro to Drush
Intro to Drush
 
Moeller bosc2010 debian_taverna
Moeller bosc2010 debian_tavernaMoeller bosc2010 debian_taverna
Moeller bosc2010 debian_taverna
 
From Forests to Farms, and Back Again: Land Use Change in the Hudson Valley
From Forests to Farms, and Back Again: Land Use Change in the Hudson Valley From Forests to Farms, and Back Again: Land Use Change in the Hudson Valley
From Forests to Farms, and Back Again: Land Use Change in the Hudson Valley
 
Web / Graphic design credentials
Web / Graphic design credentialsWeb / Graphic design credentials
Web / Graphic design credentials
 
Pictures of students in sw 475
Pictures of students in sw 475Pictures of students in sw 475
Pictures of students in sw 475
 
Tharisa platinum mine expansion project 2012
Tharisa platinum mine expansion project 2012Tharisa platinum mine expansion project 2012
Tharisa platinum mine expansion project 2012
 
Jason Yip Portfolio
Jason Yip PortfolioJason Yip Portfolio
Jason Yip Portfolio
 
Cau kien 71 105
Cau kien 71 105Cau kien 71 105
Cau kien 71 105
 
Knjiga evidencije se kci ja
Knjiga evidencije se kci jaKnjiga evidencije se kci ja
Knjiga evidencije se kci ja
 
Sfe time robbers
Sfe time robbersSfe time robbers
Sfe time robbers
 

Recently uploaded

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 

Recently uploaded (20)

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 

Bash badawi big data training resources

  • 1. Document by Bash Badawi, December, 30, 2016 Please feel free to share, however, I kindly ask to reference the source. Email me if you need further documentation, questions, suggestions. Twitter: @bashbadawi, LinkedIn Profile, My 4-part Big Data Articles on LinkedIn comparing Vendors, Stacks, etc, and Blog on WordPress. Some of the content is lifted from various sources, yet verifiable Data Scientists. Unfortunately, I do not have the references to include in this document. If you are a content provider I used, please email me to include you in the document. Use the Table of Contents to easily navigate to the desired resources. About Me: A Computer Science/Math Graduate with a Recent Master’s Degree in Business/Software Economics and a veteran of the IT industry of over 20 years.
  • 2. Contents Document by Bash Badawi, December, 30, 2016..................................................................... 1 Please feel free to share, however, I kindly ask to reference the source. Email me if you need further documentation, questions, suggestions. Twitter: @bashbadawi, LinkedIn Profile, My 4-part Big Data Articles on LinkedIn comparing Vendors, Stacks, etc, and Blog on WordPress. Some of the content is lifted from various sources, yet verifiable Data Scientists. Unfortunately, I do not have the references to include in this document. If you are a content provider I used, please email me to include you in the document. Use the Table of Contents to easily navigate to the desired resources. ............................................................................................................................................. 1 About Me: A Computer Science/Math Graduate with a Recent Master’s Degree in Business/Software Economics and a veteran of the IT industry of over 20 years. ...................................................................................................................................................... 1 Hadoop Training Resources....................................................................................................................... 4 Machine Learning Resources..................................................................................................................... 5 Big Data Lambda Architecture................................................................................................................... 6 The 40 data science techniques ................................................................................................................ 7 Data Science - DSC Resources From Analytics Bridge...............................................................................8 Additional Reading ......................................................................................................................................8 4 Ways to Spot a Fake Data Scientist ........................................................................................................ 9 Unstructured Data Definition .....................................................................................................................9 Resources................................................................................................................................................... 9 You’re Not a Data Scientist...................................................................................................................... 10 Skills needed to be a Data Scientist......................................................................................................... 10 Technical Skills: Analytics..........................................................................................................................10 Technical Skills: Computer Science...........................................................................................................10 Non-Technical Skills...................................................................................................................................10 My Data Science profile which you might want to use in your resume ................................................. 11 Microsoft Big Data Market Play – HDInsight ...........................................................................................12 HDInsight on Linux (Preview)....................................................................................................................12 HDInsight on Windows..............................................................................................................................12 Apache Hadoop..........................................................................................................................................12  Apache Hadoop - Learn more about the Apache Hadoop software library, a framework that allows for the distributed processing of large datasets across clusters of computers.......................12  HDFS - Learn more about the architecture and design of the Hadoop Distributed File System, the primary storage system used by Hadoop applications..................................................................12
  • 3.  MapReduce Tutorial - Learn more about the programming framework for writing Hadoop applications that rapidly process large amounts of data in parallel on large clusters of compute nodes.......................................................................................................................................................12 SQL Database on Azure .............................................................................................................................12  Azure SQL Database - MSDN documentation for SQL Database.................................................12  Management Portal for SQL Database - A lightweight and easy-to-use database management tool for managing SQL Database in the cloud......................................................................................12  Adventure Works for SQL Database - Download page for a SQL Database sample database...12 Microsoft Business Intelligence (for HDInsight on Windows)................................................................13 Connect Excel to Hadoop with Power Query.......................................................................................13 Connect Excel to Hadoop with the Microsoft Hive ODBC Driver........................................................13 Microsoft Cloud Platform ......................................................................................................................13 Learn about SQL Server Reporting Services.........................................................................................13 Try HDInsight solutions for big-data analysis (for HDInsight on Windows) ..........................................13 Analyze HVAC sensor data .....................................................................................................................13 Use Hive with HDInsight to analyze website logs .................................................................................13 Analyze sensor data in real-time with Storm and HBase in HDInsight (Hadoop) ...............................13 HDInsight HBase overview MSDN ........................................................................................................... 14 What is HDInsight HBase in Azure? ......................................................................................................14 How is data managed in HDInsight HBase? .........................................................................................14 Scenarios: What are the use cases for HBase? ....................................................................................14 Next steps ...............................................................................................................................................14 Get started with Apache HBase in HDInsight.......................................................................................... 15 Learn how to create HBase tables and query HBase tables by using Hive in HDInsight...................15 NOTE: HBase (version 0.98.0) is only available for use with HDInsight 3.1 clusters on HDInsight (based on Apache Hadoop and YARN 2.4.0). For version information, see what’s new in the Hadoop cluster versions provided by HDInsight?................................................................................15 Prerequisites...........................................................................................................................................15 Provision an HBase cluster........................................................................................................................15 To provision an HBase cluster by using the Azure portal .......................................................................15 NOTE: ......................................................................................................................................................16
  • 4. Hadoop Training Resources 1. http://www.youtube.com/playlist?list=PLF82F6499E89E1BAE 2. Someone started a website for the Hadoop Ecosystem. http://hadoopecosystem.whatazoo.com/. http://hadoopecosystem.whatazoo.com/home/training 3. https://www.linkedin.com/redirect?url=http%3A%2F%2Fsatya- hadoop%2Eblogspot%2Ecom%2F2013%2F03%2Fhadoop-training-institutes-in- india%2Ehtml&urlhash=sJuS&_t=tracking_disc 4. http://university.cloudera.com/training/apache_hive_and_pig/hive_and_pig.html 5. http://www.linalis.com/en/training/planning 6. https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+Demo+VM+for+CDH4 7. http://cloudwick.com/training/ 8. http://www.learningtree.com/courses/1250/introduction-to-big-data/ 9. www.bisptrainings.com 10. http://www.udemy.com 11. (http://catechnologies.in/big-data.html). 12. http://www.mapr.com/academy/ 13. By the way DatumFora also offers live online instructor lead Hadoop Courses. Check it out athttp://www.datumfora.com/#!online-hadoop-course-oct-26-27/c137j Save 20% when registering with promocode (LNKD20) 14. http://www.datumfora.com/#!2-day-hadoop-class-oct-19-20/cf4u 15. http://www.ambaricloud.com/ 16. http://www.mapr.com/academy/ 17. http://www.datumfora.com/#!upcoming-classes/ct0e 18. http://www.learningtree.com/courses/1250/introduction-to-big-data/ 19. http://cloudwick.com/training/ 20. http://www.linalis.com/en/training/planning http://university.cloudera.com/training/apache_hive_and_pig/hive_and_pig.html 21. http://www.mapr.com/products/download 22. http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support?action=show&redirec t=Distribution 23. http://hortonworks.com/blog/install-hadoop-windows-hortonworks-data-platform-2-0/ 24. http://hortonworks.com/hdp/downloads/ 25. (Try tutorial on http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/) and read more about Spark GA on HDP (http://hortonworks.com/blog/announcing-apache-spark-now-ga-on- hortonworks-data-platform/) 26. http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
  • 6. Big Data Lambda Architecture Posted on September 5, 2012 by dbtube In order to meet the challenges of Big Data, you must rethink data systems from the ground up. You will discover that some of the most basic ways people manage data in traditional systems like the relational database management system (RDBMS) is too complex for Big Data systems. The simpler, alternative approach is a new paradigm for Big Data. In this article based on chapter 1, author Nathan Marz shows you this approach he has dubbed the “lambda architecture.” This article is based on Big Data, to be published in Fall 2012. This eBook is available through the Manning Early Access Program (MEAP). Download the eBook instantly from manning.com. All print book purchases include free digital formats (PDF, ePub and Kindle). Visit the book’s page for more information based on Big Data. This content is being reproduced here by permission from Manning Publications. Author: Nathan Marz Computing arbitrary functions on an arbitrary dataset in real time is a daunting problem. There is no single tool that provides a complete solution. Instead, you have to use a variety of tools and techniques to build a complete Big Data system. The lambda architecture solves the problem of computing arbitrary functions on arbitrary data in real time by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer.
  • 7. The 40 data science techniques 1. Linear Regression 2. Logistic Regression 3. Jackknife Regression * 4. Density Estimation 5. Confidence Interval 6. Test of Hypotheses 7. Pattern Recognition 8. Clustering - (aka Unsupervised Learning) 9. Supervised Learning 10. Time Series 11. Decision Trees 12. Random Numbers 13. Monte-Carlo Simulation 14. Bayesian Statistics 15. Naive Bayes 16. Principal Component Analysis - (PCA) 17. Ensembles 18. Neural Networks 19. Support Vector Machine - (SVM) 20. Nearest Neighbors - (k-NN) 21. Feature Selection - (aka Variable Reduction) 22. Indexation / Cataloguing * 23. (Geo-) Spatial Modeling 24. Recommendation Engine * 25. Search Engine * 26. Attribution Modeling * 27. Collaborative Filtering * 28. Rule System 29. Linkage Analysis 30. Association Rules 31. Scoring Engine 32. Segmentation 33. Predictive Modeling 34. Graphs 35. Deep Learning 36. Game Theory 37. Imputation 38. Survival Analysis 39. Arbitrage 40. Lift Modeling 41. Yield Optimization 42. Cross-Validation 43. Model Fitting 44. Relevancy Algorithm * 45. Experimental Design The number of techniques is higher than 40 because we updated the article, and added additional ones.
  • 8. Data Science - DSC Resources From Analytics Bridge  Career: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | Jobs  Knowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSC  Buzz: Business News | Announcements | Events | RSS Feeds  Misc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For Bloggers Additional Reading  What statisticians think about data scientists  Data Science Compared to 16 Analytic Disciplines  10 types of data scientists  91 job interview questions for data scientists  50 Questions to Test True Data Science Knowledge  24 Uses of Statistical Modeling  21 data science systems used by Amazon to operate its business  Top 20 Big Data Experts to Follow (Includes Scoring Algorithm)  5 Data Science Leaders Share their Predictions for 2016 and Beyond  50 Articles about Hadoop and Related Topics  10 Modern Statistical Concepts Discovered by Data Scientists  Top data science keywords on DSC  4 easy steps to becoming a data scientist  22 tips for better data science  How to detect spurious correlations, and how to find the real ones  17 short tutorials all data scientists should read (and practice)  High versus low-level data science Reference: @DataScienceCtrl | @AnalyticBridge
  • 9. 4 Ways to Spot a Fake Data Scientist I’m here to tell you that from all of my conversations with data scientists and “data scientists” I’ve discovered four telltale signs that a professional is not a true data scientist: 1. Lack of a highly quantitative advanced degree – It’s incredibly rare for someone without an advanced quantitative degree to have the technical skills necessary to be a data scientist. In our data science salary report we found that 88% of data scientists have at least a Master’s degree, and 46% have a Ph.D. The areas of study may vary, but the vast majority are very rigorous quantitative, technical, or scientific programs, including Math, Statistics, Computer Science, Engineering, Economics, and Operations Research. 2. No concrete examples of experience with unstructured data – Lists of tools such as Hadoop, Python, and AWS need to be accompanied by projects that show those skills being put to good use. If a professional cannot provide clear examples of their experience with unstructured data, or mentions data science projects, but keeps their involvement very vague, then they are probably not a data scientist. If their specific role in or impact on a Big Data project is unclear, that is cause for concern. 3. Purely academic or research background – Now, this is not to say that someone with a stellar academic or research background won’t make a great data scientist, but a key component to being a data scientist in a corporate setting is business acumen. Understanding how findings affect business goals and delivering actionable insights to leaders is critical to a data scientist’s success. Many research academics have exceptional data skills, but without strong business savvy they are not data scientists… yet. 4. List of basic business skills – If I see a list of tools on a “data scientist” resume like Omniture, Google Analytics, SPSS, Excel, or any other Microsoft Office tool, you can be sure that I will take a harder look at whether or not this professional makes the grade. These skills are basic business qualifications that are insufficient for most data science positions, and by themselves are not indicative of a true data scientist. Unstructured Data Definition Unstructured Data (or unstructured information) refers to information that either does not have a pre- defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. Resources 1. Advanced Degree – More Data Science programs are popping up to serve the current demand, but there are also many Mathematics, Statistics, and Computer Science programs. 2. MOOCs –Coursera, Udacity, and code academy are good places to start. 3. Certifications – KDnuggets has compiled an extensive list. 4. Bootcamps – For more information about how this approach compares to degree programs or MOOCs, check out this guest blog from the data scientists at Datascope Analytics. 5. Kaggle – Kaggle hosts data science competitions where you can practice, hone your skills with messy, real world data, and tackle actual business problems. Employers take Kaggle rankings seriously, as they can be seen as relevant, hands-on project work. 6. LinkedIn Groups – Join relevant groups to interact with other members of the data science community. 7. Data Science Central and KDnuggets – Data Science Central and KDnuggets are good resources for staying at the forefront of industry trends in data science. 8. The Burtch Works Study: Salaries of Data Scientists – If you’re looking for more information about the salaries and demographics of current data scientists be sure to download our data scientist salary study.
  • 10. You’re Not a Data Scientist The IT biz has historically rebranded job titles based upon what’s trending — today’s Software Architects were once known as Designers or Systems Engineers. Nothing is trending faster and louder than predictive analytics, machine learning, deep learning and AI. So it’s our turn to rebrand data geeks as data scientists. Now don’t get me wrong — some of these folks are legit Data Scientists but the majority is not. I guess I’m a purist –calling yourself a scientist indicates that you practice science following a scientific method. You create hypotheses, test the hypothesis with experimental results and after proving or disproving the conjecture move on or iterate. Skills needed to be a Data Scientist Technical Skills: Analytics 1. Education – Data scientists are highly educated – 88% have at least a Master’s degree and 46% have PhDs – and while there are notable exceptions, a very strong educational background is usually required to develop the depth of knowledge necessary to be a data scientist. Their most common fields of study are Mathematics and Statistics (32%), followed by Computer Science (19%) and Engineering (16%). 2. SAS and/or R – In-depth knowledge of at least one of these analytical tools, for data science R is generally preferred. Technical Skills: Computer Science 3. Python Coding – Python is the most common coding language I typically see required in data science roles, along with Java, Perl, or C/C++. 4. Hadoop Platform – Although this isn’t always a requirement, it is heavily preferred in many cases. Having experience with Hive or Pig is also a strong selling point. Familiarity with cloud tools such as Amazon S3 can also be beneficial. 5. SQL Database/Coding – Even though NoSQL and Hadoop have become a large component of data science, it is still expected that a candidate will be able to write and execute complex queries in SQL. 6. Unstructured data – It is critical that a data scientist be able to work with unstructured data, whether it is from social media, video feeds or audio. Non-Technical Skills 7. Intellectual curiosity – No doubt you’ve seen this phrase everywhere lately, especially as it relates to data scientists. Frank Lo describes what it means, and talks about other necessary “soft skills” in his guest blog posted a few months ago. 8. Business acumen – To be a data scientist you’ll need a solid understanding of the industry you’re working in, and know what business problems your company is trying to solve. In terms of data science, being able to discern which problems are important to solve for the business is critical, in addition to identifying new ways the business should be leveraging its data. 9. Communication skills – Companies searching for a strong data scientist are looking for someone who can clearly and fluently translate their technical findings to a non-technical team, such as the Marketing or Sales departments. A data scientist must enable the business to make decisions by arming them with quantified insights, in addition to understanding the needs of their non-technical colleagues in order to wrangle the data appropriately. Check out our recent flash survey for more information on communication skills for quantitative professionals.
  • 11. My Data Science profile which you might want to use in your resume
  • 12. Microsoft Big Data Market Play – HDInsight I highly recommend HDInsight it for the non-Linux Windows developers. Machine Learning on Azure abstracts away a lot of the Big Data complexity and allows you to jump up to final analysis levels, i.e. 6-7 steps in Hadoop for 2 steps in HDInsight HDInsight on Linux (Preview)  Get started with HDInsight on Linux - A quick-start tutorial for provisioning HDInsight Hadoop clusters on Linux and running sample Hive queries.  Provision HDInsight on Linux using custom options - Learn how to provision an HDInsight Hadoop cluster on Linux by using custom options through the Azure Management Portal, Azure cross-platform command line, or Azure  Working with HDInsight on Linux - Get some quick tips on working with Hadoop Linux clusters provisioned on Azure.  Manage HDInsight clusters using Ambari - Learn how to monitor and manage your Linux-based Hadoop on HDInsight cluster by using Ambari Web, or the Ambari REST API. HDInsight on Windows  HDInsight documentation - The documentation page for Azure HDInsight with links to articles, videos, and more resources.  Learning map for HDInsight - A guided tour of Hadoop documentation for HDInsight.  Get started with Azure HDInsight - A quick-start tutorial for using Hadoop in HDInsight.  Run the HDInsight samples - A tutorial on how to run the samples that ship with HDInsight.  Azure HDInsight SDK - Reference documentation for the HDInsight SDK. Apache Hadoop  Apache Hadoop - Learn more about the Apache Hadoop software library, a framework that allows for the distributed processing of large datasets across clusters of computers.  HDFS - Learn more about the architecture and design of the Hadoop Distributed File System, the primary storage system used by Hadoop applications.  MapReduce Tutorial - Learn more about the programming framework for writing Hadoop applications that rapidly process large amounts of data in parallel on large clusters of compute nodes. SQL Database on Azure  Azure SQL Database - MSDN documentation for SQL Database.  Management Portal for SQL Database - A lightweight and easy-to-use database management tool for managing SQL Database in the cloud.  Adventure Works for SQL Database - Download page for a SQL Database sample database.
  • 13. Microsoft Business Intelligence (for HDInsight on Windows) Familiar business intelligence (BI) tools - such as Excel, PowerPivot, SQL Server Analysis Services, and SQL Server Reporting Services - retrieve, analyze, and report data integrated with HDInsight by using either the Power Query add-in or the Microsoft Hive ODBC Driver. These BI tools can help in your big-data analysis: Connect Excel to Hadoop with Power Query  Learn how to connect Excel to the Azure Storage account that stores the data associated with your HDInsight cluster by using Microsoft Power Query for Excel. Connect Excel to Hadoop with the Microsoft Hive ODBC Driver  Learn how to import data from HDInsight with the Microsoft Hive ODBC Driver. Microsoft Cloud Platform  Learn about Power BI for Office 365, download the SQL Server trial, and set up SharePoint Server 2013 and SQL Server BI.  Learn more about SQL Server Analysis Services. Learn about SQL Server Reporting Services Try HDInsight solutions for big-data analysis (for HDInsight on Windows) Analyze data from your organization to gain insights into your business. Here are some examples: Analyze HVAC sensor data Learn how to analyze sensor data by using Hive with HDInsight (Hadoop), and then visualize the data in Microsoft Excel. In this sample, you'll use Hive to process historical data produced by HVAC systems to see which systems can't reliably maintain a set temperature. Use Hive with HDInsight to analyze website logs Learn how to use HiveQL in HDInsight to analyze website logs to get insight into the frequency of visits in a day from external websites, and a summary of website errors that the users experience. Analyze sensor data in real-time with Storm and HBase in HDInsight (Hadoop) Learn how to build a solution that uses a Storm cluster in HDInsight to process sensor data from Azure Event Hubs, and then displays the processed sensor data as near-real-time information on a web-based dashboard. To try Hadoop on HDInsight, see "Get started" articles in the Explore section on the HDInsight documentation page. To try more advanced examples, scroll down to the Analyze section.
  • 14. HDInsight HBase overview MSDN HBase is an Apache, open-source, NoSQL database that is built on Hadoop. HBase provides random access and strong consistency for large amounts of unstructured and semistructured data. It was modeled on Google's BigTable, and it is a column-family-oriented database. Data is stored in the rows of a table, and data within a row is grouped by column family. HBase is a schema-less database in the sense that neither the columns nor the type of data stored in them need to be defined before using them. The open-source code scales linearly to handle petabytes of data on thousands of nodes. It can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop ecosystem. What is HDInsight HBase in Azure? HDInsight HBase is offered as a managed cluster that is integrated into the Azure environment. The clusters are configured to store data directly in Azure Blob storage, which provides low latency and increased elasticity in performance and cost choices. This enables customers to build interactive websites that work with large datasets, to build services that store sensor and telemetry data from millions of end points, and to analyze this data with Hadoop jobs. HBase and Hadoop are good starting points for big data project in Azure; in particular, they can enable real-time applications to work with large datasets. The HDInsight implementation leverages the scale-out architecture of HBase to provide automatic sharding of tables, strong consistency for reads and writes, and automatic failover. Performance is enhanced by in-memory caching for reads and high-throughput streaming for writes. Virtual network provisioning is also available for HDInsight HBase. For details, see Provision HDInsight clusters on Azure Virtual Network. How is data managed in HDInsight HBase? Data can be managed in HBase by using the Create, Get, Put, and Scan commands from the HBase shell. Data is written to the database by using put and read by using get. The scan command is used to obtain data from multiple rows in a table. Data can also be managed using the HBase C# API, which provides a client library on top of the HBase REST API. An HBase database can also be queried by using Hive. For an introduction to these programming models, see Get started using HBase with Hadoop in HDInsight. Co-processors are also available, which allow data processing in the nodes that host the database. Scenarios: What are the use cases for HBase? The canonical use case for which BigTable (and by extension, HBase) was created was web search. Search engines build indexes that map terms to the web pages that contain them. But there are many other use cases that HBase is suitable for—several of which are itemized in this section.  Key-value store HBase can be used as a key-value store, and it is suitable for managing message systems. Facebook uses HBase for their messaging system, and it is ideal for storing and managing Internet communications. WebTable uses HBase to search for and manage tables that are extracted from webpages.  Sensor data HBase is useful for capturing data that is collected incrementally from various sources. This includes social analytics, time series, keeping interactive dashboards up-to-date with trends and counters, and managing audit log systems. Examples include Bloomberg trader terminal and the Open Time Series Database (OpenTSDB), which stores and provides access to metrics collected about the health of server systems.  Real-time query Phoenix is a SQL query engine for Apache HBase. It is accessed as a JDBC driver, and it enables querying and managing HBase tables by using SQL.  HBase as a platform Applications can run on top of HBase by using it as a datastore. Examples include Phoenix, OpenTSDB, Kiji, and Titan. Applications can also integrate with HBase. Examples include Hive, Pig, Solr, Storm, Flume, Impala, Spark, Ganglia, and Drill. Next steps  Get started using HBase with Hadoop in HDInsight  Provision HDInsight clusters on Azure Virtual Network  Configure HBase replication in HDInsight  Analyze Twitter sentiment with HBase in HDInsight  Use Maven to build Java applications that use HBase with HDInsight (Hadoop)
  • 15. Get started with Apache HBase in HDInsight Learn how to create HBase tables and query HBase tables by using Hive in HDInsight. HBase is a low-latency NoSQL database that allows online transactional processing of big data. HBase is offered as a managed cluster that is integrated into the Azure environment. The clusters are configured to store data directly in Azure Blob storage, which provides low latency and increased elasticity in performance and cost choices. This enables customers to build interactive websites that work with large datasets, to build services that store sensor and telemetry data from millions of end points, and to analyze this data with Hadoop jobs. For more information about HBase and the scenarios it can be used for, see HDInsight HBase overview. NOTE: HBase (version 0.98.0) is only available for use with HDInsight 3.1 clusters on HDInsight (based on Apache Hadoop and YARN 2.4.0). For version information, see what’s new in the Hadoop cluster versions provided by HDInsight? Prerequisites Before you begin this tutorial, you must have the following:  An Azure subscription: For more information about obtaining a subscription, see Purchase Options, Member Offers, or Free Trial.  An Azure storage account: For instructions, see How To Create a Storage Account.  A workstation with Visual Studio 2013 installed: For instructions, see Installing Visual Studio. Provision an HBase cluster NOTE: 1. The steps in this article create an HDInsight cluster by using basic configuration settings. For information about other cluster configuration settings (such as using Azure virtual network or a metastore for Hive and Oozie), see Provision Hadoop clusters in HDInsight by using custom options. To provision an HBase cluster by using the Azure portal 1. Sign in to the Azure portal. 2. Click NEW in the lower left, and then click DATA SERVICES > HDINSIGHT > HBASE. You can also use the CUSTOM CREATE option (The above is the older classic portal, the below is the new portal using the Resource Manager Construct) 1. Enter CLUSTER NAME, CLUSTER SIZE, CLUSTER USER PASSWORD, and STORAGE ACCOUNT.
  • 16. The default HTTP USER NAME is admin. You can customize the name by using the CUSTOM CREATION option. WARNING: For high availability of HBase services, you must provision a cluster that contains at least three nodes. This ensures that, if one node goes down, the HBase data regions are available on other nodes. 1. Click the checkmark icon in the lower right to create the HBase cluster. NOTE: After an HBase cluster is deleted, you can create another HBase cluster by using the same default blob. The new cluster will pick up the HBase tables you created in the original cluster.