An aggregation of Big Data training resources, different paths to achieve experience, research methods, virtual machines, comparison of vendors and various other resources.
This is the output of a project I delivered last year and a must-read to anyone who is trying to break into the Big Data space.
I believe it will reduce the time it will take someone to manually do the work.
Search Engine Marketing class final presentation for United Way of Greater Toledo. It covers analysis of their Google AdWords campaign with suggestions for keyword and website improvements and modifications.
How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...cloudcontroller
Don't pay up to 10% of your monthly AWS bill to report on AWS charges and Instance usage with products like Cloudability and Cloudcheckr. Get a Splunk! free license and the free app Splunk App for AWS usage tracking (http://apps.splunk.com/app/1274/). This presentation from splunk > live! San Francisco 2013 shows how Edmodo stays on top of Reserved Instance usage and uses AWS resource tag-based reporting to help teams manage their AWS usage,
Search Engine Marketing class final presentation for United Way of Greater Toledo. It covers analysis of their Google AdWords campaign with suggestions for keyword and website improvements and modifications.
How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...cloudcontroller
Don't pay up to 10% of your monthly AWS bill to report on AWS charges and Instance usage with products like Cloudability and Cloudcheckr. Get a Splunk! free license and the free app Splunk App for AWS usage tracking (http://apps.splunk.com/app/1274/). This presentation from splunk > live! San Francisco 2013 shows how Edmodo stays on top of Reserved Instance usage and uses AWS resource tag-based reporting to help teams manage their AWS usage,
The Hudson Valley is a treasured landscape that has undergone tremendous change over the past century. This forum explores how science-based stewardship on private land can help protect and promote healthy forests and open spaces, now and for future generations.
Presentations (5) explore threats our forests and natural areas face – from invasive species and climate change to deer overabundance – and actions that can be taken on a site-by-site basis to optimize conditions. A special focus will be given to the overlap between sport hunting and conservation communities, with a roundtable discussion on advancing common ground. Hosted April 12, 2014 at Cary Institute of Ecosystem Studies
Presentation Part I by: Charles Canham, Forest Ecologist, Cary Institute of Ecosystem Studies
Vinod Batus is a Mumbai, India based independent creative
consultant specialising in online, graphic design & digital illustration.
Services include:
Website design & development
Graphic Design
Digital & Social media marketing
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
The Hudson Valley is a treasured landscape that has undergone tremendous change over the past century. This forum explores how science-based stewardship on private land can help protect and promote healthy forests and open spaces, now and for future generations.
Presentations (5) explore threats our forests and natural areas face – from invasive species and climate change to deer overabundance – and actions that can be taken on a site-by-site basis to optimize conditions. A special focus will be given to the overlap between sport hunting and conservation communities, with a roundtable discussion on advancing common ground. Hosted April 12, 2014 at Cary Institute of Ecosystem Studies
Presentation Part I by: Charles Canham, Forest Ecologist, Cary Institute of Ecosystem Studies
Vinod Batus is a Mumbai, India based independent creative
consultant specialising in online, graphic design & digital illustration.
Services include:
Website design & development
Graphic Design
Digital & Social media marketing
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
1. Document by Bash Badawi, December, 30, 2016
Please feel free to share, however, I kindly ask to
reference the source. Email me if you need further
documentation, questions, suggestions. Twitter:
@bashbadawi, LinkedIn Profile, My 4-part Big Data
Articles on LinkedIn comparing Vendors, Stacks, etc,
and Blog on WordPress. Some of the content is lifted
from various sources, yet verifiable Data Scientists.
Unfortunately, I do not have the references to include
in this document. If you are a content provider I
used, please email me to include you in the document.
Use the Table of Contents to easily navigate to the
desired resources.
About Me: A Computer Science/Math Graduate with a
Recent Master’s Degree in Business/Software Economics
and a veteran of the IT industry of over 20 years.
2. Contents
Document by Bash Badawi, December, 30, 2016..................................................................... 1
Please feel free to share, however, I kindly ask to reference the source.
Email me if you need further documentation, questions, suggestions.
Twitter: @bashbadawi, LinkedIn Profile, My 4-part Big Data Articles on
LinkedIn comparing Vendors, Stacks, etc, and Blog on WordPress. Some of
the content is lifted from various sources, yet verifiable Data Scientists.
Unfortunately, I do not have the references to include in this document. If
you are a content provider I used, please email me to include you in the
document. Use the Table of Contents to easily navigate to the desired
resources. ............................................................................................................................................. 1
About Me: A Computer Science/Math Graduate with a Recent Master’s Degree in
Business/Software Economics and a veteran of the IT industry of over 20
years. ...................................................................................................................................................... 1
Hadoop Training Resources....................................................................................................................... 4
Machine Learning Resources..................................................................................................................... 5
Big Data Lambda Architecture................................................................................................................... 6
The 40 data science techniques ................................................................................................................ 7
Data Science - DSC Resources From Analytics Bridge...............................................................................8
Additional Reading ......................................................................................................................................8
4 Ways to Spot a Fake Data Scientist ........................................................................................................ 9
Unstructured Data Definition .....................................................................................................................9
Resources................................................................................................................................................... 9
You’re Not a Data Scientist...................................................................................................................... 10
Skills needed to be a Data Scientist......................................................................................................... 10
Technical Skills: Analytics..........................................................................................................................10
Technical Skills: Computer Science...........................................................................................................10
Non-Technical Skills...................................................................................................................................10
My Data Science profile which you might want to use in your resume ................................................. 11
Microsoft Big Data Market Play – HDInsight ...........................................................................................12
HDInsight on Linux (Preview)....................................................................................................................12
HDInsight on Windows..............................................................................................................................12
Apache Hadoop..........................................................................................................................................12
Apache Hadoop - Learn more about the Apache Hadoop software library, a framework that
allows for the distributed processing of large datasets across clusters of computers.......................12
HDFS - Learn more about the architecture and design of the Hadoop Distributed File System,
the primary storage system used by Hadoop applications..................................................................12
3. MapReduce Tutorial - Learn more about the programming framework for writing Hadoop
applications that rapidly process large amounts of data in parallel on large clusters of compute
nodes.......................................................................................................................................................12
SQL Database on Azure .............................................................................................................................12
Azure SQL Database - MSDN documentation for SQL Database.................................................12
Management Portal for SQL Database - A lightweight and easy-to-use database management
tool for managing SQL Database in the cloud......................................................................................12
Adventure Works for SQL Database - Download page for a SQL Database sample database...12
Microsoft Business Intelligence (for HDInsight on Windows)................................................................13
Connect Excel to Hadoop with Power Query.......................................................................................13
Connect Excel to Hadoop with the Microsoft Hive ODBC Driver........................................................13
Microsoft Cloud Platform ......................................................................................................................13
Learn about SQL Server Reporting Services.........................................................................................13
Try HDInsight solutions for big-data analysis (for HDInsight on Windows) ..........................................13
Analyze HVAC sensor data .....................................................................................................................13
Use Hive with HDInsight to analyze website logs .................................................................................13
Analyze sensor data in real-time with Storm and HBase in HDInsight (Hadoop) ...............................13
HDInsight HBase overview MSDN ........................................................................................................... 14
What is HDInsight HBase in Azure? ......................................................................................................14
How is data managed in HDInsight HBase? .........................................................................................14
Scenarios: What are the use cases for HBase? ....................................................................................14
Next steps ...............................................................................................................................................14
Get started with Apache HBase in HDInsight.......................................................................................... 15
Learn how to create HBase tables and query HBase tables by using Hive in HDInsight...................15
NOTE: HBase (version 0.98.0) is only available for use with HDInsight 3.1 clusters on HDInsight
(based on Apache Hadoop and YARN 2.4.0). For version information, see what’s new in the
Hadoop cluster versions provided by HDInsight?................................................................................15
Prerequisites...........................................................................................................................................15
Provision an HBase cluster........................................................................................................................15
To provision an HBase cluster by using the Azure portal .......................................................................15
NOTE: ......................................................................................................................................................16
4. Hadoop Training Resources
1. http://www.youtube.com/playlist?list=PLF82F6499E89E1BAE
2. Someone started a website for the Hadoop Ecosystem. http://hadoopecosystem.whatazoo.com/.
http://hadoopecosystem.whatazoo.com/home/training
3. https://www.linkedin.com/redirect?url=http%3A%2F%2Fsatya-
hadoop%2Eblogspot%2Ecom%2F2013%2F03%2Fhadoop-training-institutes-in-
india%2Ehtml&urlhash=sJuS&_t=tracking_disc
4. http://university.cloudera.com/training/apache_hive_and_pig/hive_and_pig.html
5. http://www.linalis.com/en/training/planning
6. https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+Demo+VM+for+CDH4
7. http://cloudwick.com/training/
8. http://www.learningtree.com/courses/1250/introduction-to-big-data/
9. www.bisptrainings.com
10. http://www.udemy.com
11. (http://catechnologies.in/big-data.html).
12. http://www.mapr.com/academy/
13. By the way DatumFora also offers live online instructor lead Hadoop Courses. Check it out
athttp://www.datumfora.com/#!online-hadoop-course-oct-26-27/c137j Save 20% when registering with
promocode (LNKD20)
14. http://www.datumfora.com/#!2-day-hadoop-class-oct-19-20/cf4u
15. http://www.ambaricloud.com/
16. http://www.mapr.com/academy/
17. http://www.datumfora.com/#!upcoming-classes/ct0e
18. http://www.learningtree.com/courses/1250/introduction-to-big-data/
19. http://cloudwick.com/training/
20. http://www.linalis.com/en/training/planning
http://university.cloudera.com/training/apache_hive_and_pig/hive_and_pig.html
21. http://www.mapr.com/products/download
22. http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support?action=show&redirec
t=Distribution
23. http://hortonworks.com/blog/install-hadoop-windows-hortonworks-data-platform-2-0/
24. http://hortonworks.com/hdp/downloads/
25. (Try tutorial on http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/) and read more
about Spark GA on HDP (http://hortonworks.com/blog/announcing-apache-spark-now-ga-on-
hortonworks-data-platform/)
26. http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
6. Big Data Lambda Architecture
Posted on September 5, 2012 by dbtube
In order to meet the challenges of Big Data, you must rethink data systems from the ground up. You will discover that
some of the most basic ways people manage data in traditional systems like the relational database management
system (RDBMS) is too complex for Big Data systems. The simpler, alternative approach is a new paradigm for Big Data.
In this article based on chapter 1, author Nathan Marz shows you this approach he has dubbed the “lambda
architecture.”
This article is based on Big Data, to be published in Fall 2012. This eBook is available through the Manning Early Access
Program (MEAP). Download the eBook instantly from manning.com. All print book purchases include free digital formats
(PDF, ePub and Kindle). Visit the book’s page for more information based on Big Data. This content is being reproduced
here by permission from Manning Publications.
Author: Nathan Marz
Computing arbitrary functions on an arbitrary dataset in real time is a daunting problem. There is no single tool that
provides a complete solution. Instead, you have to use a variety of tools and techniques to build a complete Big Data
system.
The lambda architecture solves the problem of computing arbitrary functions on arbitrary data in real time by
decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer.
7. The 40 data science techniques
1. Linear Regression
2. Logistic Regression
3. Jackknife Regression *
4. Density Estimation
5. Confidence Interval
6. Test of Hypotheses
7. Pattern Recognition
8. Clustering - (aka Unsupervised Learning)
9. Supervised Learning
10. Time Series
11. Decision Trees
12. Random Numbers
13. Monte-Carlo Simulation
14. Bayesian Statistics
15. Naive Bayes
16. Principal Component Analysis - (PCA)
17. Ensembles
18. Neural Networks
19. Support Vector Machine - (SVM)
20. Nearest Neighbors - (k-NN)
21. Feature Selection - (aka Variable Reduction)
22. Indexation / Cataloguing *
23. (Geo-) Spatial Modeling
24. Recommendation Engine *
25. Search Engine *
26. Attribution Modeling *
27. Collaborative Filtering *
28. Rule System
29. Linkage Analysis
30. Association Rules
31. Scoring Engine
32. Segmentation
33. Predictive Modeling
34. Graphs
35. Deep Learning
36. Game Theory
37. Imputation
38. Survival Analysis
39. Arbitrage
40. Lift Modeling
41. Yield Optimization
42. Cross-Validation
43. Model Fitting
44. Relevancy Algorithm *
45. Experimental Design
The number of techniques is higher than 40 because we updated the article, and added additional ones.
8. Data Science - DSC Resources From Analytics Bridge
Career: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | Jobs
Knowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSC
Buzz: Business News | Announcements | Events | RSS Feeds
Misc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For Bloggers
Additional Reading
What statisticians think about data scientists
Data Science Compared to 16 Analytic Disciplines
10 types of data scientists
91 job interview questions for data scientists
50 Questions to Test True Data Science Knowledge
24 Uses of Statistical Modeling
21 data science systems used by Amazon to operate its business
Top 20 Big Data Experts to Follow (Includes Scoring Algorithm)
5 Data Science Leaders Share their Predictions for 2016 and Beyond
50 Articles about Hadoop and Related Topics
10 Modern Statistical Concepts Discovered by Data Scientists
Top data science keywords on DSC
4 easy steps to becoming a data scientist
22 tips for better data science
How to detect spurious correlations, and how to find the real ones
17 short tutorials all data scientists should read (and practice)
High versus low-level data science
Reference: @DataScienceCtrl | @AnalyticBridge
9. 4 Ways to Spot a Fake Data Scientist
I’m here to tell you that from all of my conversations with data scientists and “data scientists” I’ve discovered four
telltale signs that a professional is not a true data scientist:
1. Lack of a highly quantitative advanced degree – It’s incredibly rare for someone without an advanced
quantitative degree to have the technical skills necessary to be a data scientist. In our data science salary
report we found that 88% of data scientists have at least a Master’s degree, and 46% have a Ph.D. The areas
of study may vary, but the vast majority are very rigorous quantitative, technical, or scientific programs,
including Math, Statistics, Computer Science, Engineering, Economics, and Operations Research.
2. No concrete examples of experience with unstructured data – Lists of tools such as Hadoop, Python, and AWS
need to be accompanied by projects that show those skills being put to good use. If a professional cannot
provide clear examples of their experience with unstructured data, or mentions data science projects, but
keeps their involvement very vague, then they are probably not a data scientist. If their specific role in or impact
on a Big Data project is unclear, that is cause for concern.
3. Purely academic or research background – Now, this is not to say that someone with a stellar academic or
research background won’t make a great data scientist, but a key component to being a data scientist in a
corporate setting is business acumen. Understanding how findings affect business goals and delivering
actionable insights to leaders is critical to a data scientist’s success. Many research academics have exceptional
data skills, but without strong business savvy they are not data scientists… yet.
4. List of basic business skills – If I see a list of tools on a “data scientist” resume like Omniture, Google Analytics,
SPSS, Excel, or any other Microsoft Office tool, you can be sure that I will take a harder look at whether or not
this professional makes the grade. These skills are basic business qualifications that are insufficient for most data
science positions, and by themselves are not indicative of a true data scientist.
Unstructured Data Definition
Unstructured Data (or unstructured information) refers to information that either does not have a pre-
defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but
may contain data such as dates, numbers, and facts as well.
Resources
1. Advanced Degree – More Data Science programs are popping up to serve the current demand, but there
are also many Mathematics, Statistics, and Computer Science programs.
2. MOOCs –Coursera, Udacity, and code academy are good places to start.
3. Certifications – KDnuggets has compiled an extensive list.
4. Bootcamps – For more information about how this approach compares to degree programs or MOOCs, check
out this guest blog from the data scientists at Datascope Analytics.
5. Kaggle – Kaggle hosts data science competitions where you can practice, hone your skills with messy, real
world data, and tackle actual business problems. Employers take Kaggle rankings seriously, as they can be seen
as relevant, hands-on project work.
6. LinkedIn Groups – Join relevant groups to interact with other members of the data science community.
7. Data Science Central and KDnuggets – Data Science Central and KDnuggets are good resources for staying
at the forefront of industry trends in data science.
8. The Burtch Works Study: Salaries of Data Scientists – If you’re looking for more information about the salaries
and demographics of current data scientists be sure to download our data scientist salary study.
10. You’re Not a Data Scientist
The IT biz has historically rebranded job titles based upon what’s trending — today’s Software Architects were once
known as Designers or Systems Engineers. Nothing is trending faster and louder than predictive analytics, machine
learning, deep learning and AI. So it’s our turn to rebrand data geeks as data scientists. Now don’t get me wrong — some
of these folks are legit Data Scientists but the majority is not. I guess I’m a purist –calling yourself a scientist indicates
that you practice science following a scientific method. You create hypotheses, test the hypothesis with experimental
results and after proving or disproving the conjecture move on or iterate.
Skills needed to be a Data Scientist
Technical Skills: Analytics
1. Education – Data scientists are highly educated – 88% have at least a Master’s degree and 46% have PhDs – and
while there are notable exceptions, a very strong educational background is usually required to develop the
depth of knowledge necessary to be a data scientist. Their most common fields of study are Mathematics and
Statistics (32%), followed by Computer Science (19%) and Engineering (16%).
2. SAS and/or R – In-depth knowledge of at least one of these analytical tools, for data science R is generally
preferred.
Technical Skills: Computer Science
3. Python Coding – Python is the most common coding language I typically see required in data science roles, along
with Java, Perl, or C/C++.
4. Hadoop Platform – Although this isn’t always a requirement, it is heavily preferred in many cases. Having
experience with Hive or Pig is also a strong selling point. Familiarity with cloud tools such as Amazon S3 can
also be beneficial.
5. SQL Database/Coding – Even though NoSQL and Hadoop have become a large component of data science, it is
still expected that a candidate will be able to write and execute complex queries in SQL.
6. Unstructured data – It is critical that a data scientist be able to work with unstructured data, whether it is from
social media, video feeds or audio.
Non-Technical Skills
7. Intellectual curiosity – No doubt you’ve seen this phrase everywhere lately, especially as it relates to data
scientists. Frank Lo describes what it means, and talks about other necessary “soft skills” in his guest
blog posted a few months ago.
8. Business acumen – To be a data scientist you’ll need a solid understanding of the industry you’re working in,
and know what business problems your company is trying to solve. In terms of data science, being able to
discern which problems are important to solve for the business is critical, in addition to identifying new ways the
business should be leveraging its data.
9. Communication skills – Companies searching for a strong data scientist are looking for someone who can clearly
and fluently translate their technical findings to a non-technical team, such as the Marketing or Sales
departments. A data scientist must enable the business to make decisions by arming them with quantified
insights, in addition to understanding the needs of their non-technical colleagues in order to wrangle the data
appropriately. Check out our recent flash survey for more information on communication skills for
quantitative professionals.
11. My Data Science profile which you might want to use in your resume
12. Microsoft Big Data Market Play – HDInsight
I highly recommend HDInsight it for the non-Linux Windows developers.
Machine Learning on Azure abstracts away a lot of the Big Data
complexity and allows you to jump up to final analysis levels, i.e. 6-7
steps in Hadoop for 2 steps in HDInsight
HDInsight on Linux (Preview)
Get started with HDInsight on Linux - A quick-start tutorial for provisioning HDInsight Hadoop clusters on
Linux and running sample Hive queries.
Provision HDInsight on Linux using custom options - Learn how to provision an HDInsight Hadoop cluster on
Linux by using custom options through the Azure Management Portal, Azure cross-platform command line,
or Azure
Working with HDInsight on Linux - Get some quick tips on working with Hadoop Linux clusters provisioned
on Azure.
Manage HDInsight clusters using Ambari - Learn how to monitor and manage your Linux-based Hadoop on
HDInsight cluster by using Ambari Web, or the Ambari REST API.
HDInsight on Windows
HDInsight documentation - The documentation page for Azure HDInsight with links to articles, videos, and
more resources.
Learning map for HDInsight - A guided tour of Hadoop documentation for HDInsight.
Get started with Azure HDInsight - A quick-start tutorial for using Hadoop in HDInsight.
Run the HDInsight samples - A tutorial on how to run the samples that ship with HDInsight.
Azure HDInsight SDK - Reference documentation for the HDInsight SDK.
Apache Hadoop
Apache Hadoop - Learn more about the Apache Hadoop software library, a framework that allows for the
distributed processing of large datasets across clusters of computers.
HDFS - Learn more about the architecture and design of the Hadoop Distributed File System, the primary
storage system used by Hadoop applications.
MapReduce Tutorial - Learn more about the programming framework for writing Hadoop applications
that rapidly process large amounts of data in parallel on large clusters of compute nodes.
SQL Database on Azure
Azure SQL Database - MSDN documentation for SQL Database.
Management Portal for SQL Database - A lightweight and easy-to-use database management tool for managing
SQL Database in the cloud.
Adventure Works for SQL Database - Download page for a SQL Database sample database.
13. Microsoft Business Intelligence (for HDInsight on Windows)
Familiar business intelligence (BI) tools - such as Excel, PowerPivot, SQL Server Analysis Services, and SQL Server
Reporting Services - retrieve, analyze, and report data integrated with HDInsight by using either the Power Query add-in
or the Microsoft Hive ODBC Driver.
These BI tools can help in your big-data analysis:
Connect Excel to Hadoop with Power Query
Learn how to connect Excel to the Azure Storage account that stores the data associated with your HDInsight
cluster by using Microsoft Power Query for Excel.
Connect Excel to Hadoop with the Microsoft Hive ODBC Driver
Learn how to import data from HDInsight with the Microsoft Hive ODBC Driver.
Microsoft Cloud Platform
Learn about Power BI for Office 365, download the SQL Server trial, and set up SharePoint Server 2013 and SQL
Server BI.
Learn more about SQL Server Analysis Services.
Learn about SQL Server Reporting Services
Try HDInsight solutions for big-data analysis (for HDInsight on Windows)
Analyze data from your organization to gain insights into your business. Here are some examples:
Analyze HVAC sensor data
Learn how to analyze sensor data by using Hive with HDInsight (Hadoop), and then visualize the data in Microsoft Excel.
In this sample, you'll use Hive to process historical data produced by HVAC systems to see which systems can't reliably
maintain a set temperature.
Use Hive with HDInsight to analyze website logs
Learn how to use HiveQL in HDInsight to analyze website logs to get insight into the frequency of visits in a day from
external websites, and a summary of website errors that the users experience.
Analyze sensor data in real-time with Storm and HBase in HDInsight (Hadoop)
Learn how to build a solution that uses a Storm cluster in HDInsight to process sensor data from Azure Event Hubs, and
then displays the processed sensor data as near-real-time information on a web-based dashboard.
To try Hadoop on HDInsight, see "Get started" articles in the Explore section on the HDInsight documentation page. To
try more advanced examples, scroll down to the Analyze section.
14. HDInsight HBase overview MSDN
HBase is an Apache, open-source, NoSQL database that is built on Hadoop. HBase provides random access and strong
consistency for large amounts of unstructured and semistructured data. It was modeled on Google's BigTable, and it is
a column-family-oriented database. Data is stored in the rows of a table, and data within a row is grouped by column
family. HBase is a schema-less database in the sense that neither the columns nor the type of data stored in them need
to be defined before using them. The open-source code scales linearly to handle petabytes of data on thousands of
nodes. It can rely on data redundancy, batch processing, and other features that are provided by distributed applications
in the Hadoop ecosystem.
What is HDInsight HBase in Azure?
HDInsight HBase is offered as a managed cluster that is integrated into the Azure environment. The clusters are
configured to store data directly in Azure Blob storage, which provides low latency and increased elasticity in
performance and cost choices. This enables customers to build interactive websites that work with large datasets, to
build services that store sensor and telemetry data from millions of end points, and to analyze this data with Hadoop
jobs. HBase and Hadoop are good starting points for big data project in Azure; in particular, they can enable real-time
applications to work with large datasets.
The HDInsight implementation leverages the scale-out architecture of HBase to provide automatic sharding of tables,
strong consistency for reads and writes, and automatic failover. Performance is enhanced by in-memory caching for
reads and high-throughput streaming for writes. Virtual network provisioning is also available for HDInsight HBase. For
details, see Provision HDInsight clusters on Azure Virtual Network.
How is data managed in HDInsight HBase?
Data can be managed in HBase by using the Create, Get, Put, and Scan commands from the HBase shell. Data is written
to the database by using put and read by using get. The scan command is used to obtain data from multiple rows in a
table. Data can also be managed using the HBase C# API, which provides a client library on top of the HBase REST API.
An HBase database can also be queried by using Hive. For an introduction to these programming models, see Get
started using HBase with Hadoop in HDInsight. Co-processors are also available, which allow data processing in the
nodes that host the database.
Scenarios: What are the use cases for HBase?
The canonical use case for which BigTable (and by extension, HBase) was created was web search. Search engines
build indexes that map terms to the web pages that contain them. But there are many other use cases that HBase is
suitable for—several of which are itemized in this section.
Key-value store HBase can be used as a key-value store, and it is suitable for managing message systems.
Facebook uses HBase for their messaging system, and it is ideal for storing and managing Internet
communications. WebTable uses HBase to search for and manage tables that are extracted from webpages.
Sensor data HBase is useful for capturing data that is collected incrementally from various sources. This
includes social analytics, time series, keeping interactive dashboards up-to-date with trends and counters,
and managing audit log systems. Examples include Bloomberg trader terminal and the Open Time Series
Database (OpenTSDB), which stores and provides access to metrics collected about the health of server
systems.
Real-time query Phoenix is a SQL query engine for Apache HBase. It is accessed as a JDBC driver, and it
enables querying and managing HBase tables by using SQL.
HBase as a platform Applications can run on top of HBase by using it as a datastore. Examples include
Phoenix, OpenTSDB, Kiji, and Titan. Applications can also integrate with HBase. Examples include Hive, Pig,
Solr, Storm, Flume, Impala, Spark, Ganglia, and Drill.
Next steps
Get started using HBase with Hadoop in HDInsight
Provision HDInsight clusters on Azure Virtual Network
Configure HBase replication in HDInsight
Analyze Twitter sentiment with HBase in HDInsight
Use Maven to build Java applications that use HBase with HDInsight (Hadoop)
15. Get started with Apache HBase in HDInsight
Learn how to create HBase tables and query HBase tables by using Hive in HDInsight.
HBase is a low-latency NoSQL database that allows online transactional processing of big data. HBase is offered as a
managed cluster that is integrated into the Azure environment. The clusters are configured to store data directly in
Azure Blob storage, which provides low latency and increased elasticity in performance and cost choices. This enables
customers to build interactive websites that work with large datasets, to build services that store sensor and telemetry
data from millions of end points, and to analyze this data with Hadoop jobs. For more information about HBase and the
scenarios it can be used for, see HDInsight HBase overview.
NOTE: HBase (version 0.98.0) is only available for use with HDInsight 3.1 clusters on HDInsight (based on Apache Hadoop
and YARN 2.4.0). For version information, see what’s new in the Hadoop cluster versions provided by HDInsight?
Prerequisites
Before you begin this tutorial, you must have the following:
An Azure subscription: For more information about obtaining a subscription, see Purchase Options, Member
Offers, or Free Trial.
An Azure storage account: For instructions, see How To Create a Storage Account.
A workstation with Visual Studio 2013 installed: For instructions, see Installing Visual Studio.
Provision an HBase cluster
NOTE:
1. The steps in this article create an HDInsight cluster by using basic configuration settings. For
information about other cluster configuration settings (such as using Azure virtual network
or a metastore for Hive and Oozie), see Provision Hadoop clusters in HDInsight by using custom
options.
To provision an HBase cluster by using the Azure portal
1. Sign in to the Azure portal.
2. Click NEW in the lower left, and then click DATA SERVICES > HDINSIGHT > HBASE.
You can also use the CUSTOM CREATE option (The above is the older classic portal, the below is the new portal
using the Resource Manager Construct)
1. Enter CLUSTER NAME, CLUSTER SIZE, CLUSTER USER PASSWORD, and STORAGE ACCOUNT.
16. The default HTTP USER NAME is admin. You can customize the name by using the CUSTOM CREATION option.
WARNING:
For high availability of HBase services, you must provision a cluster that contains at least three nodes. This ensures that,
if one node goes down, the HBase data regions are available on other nodes.
1. Click the checkmark icon in the lower right to create the HBase cluster.
NOTE:
After an HBase cluster is deleted, you can create another HBase cluster by using the same default blob. The new cluster
will pick up the HBase tables you created in the original cluster.