Introduction to Microsoft Hadoop

•Download as PPTX, PDF•

2 likes•403 views

This is the slide deck from the June meeting of the Boise Web Technologies Group. Cindy Gross from Microsoft presented on Hadoop.

INTRODUCTION TO
HADOOP

Cindy Gross | @SQLCindy | SQLCAT PM
http://blogs.msdn.com/cindygross

SELECT
deviceplatform, state, c
ountry
FROM hivesampletable
LIMIT 200;

Hadoop: The Definitive Guide by Tom White
SQL Server Sqoop http://bit.ly/rulsjX
JavaScript http://bit.ly/wdaTv6
Twitter https://twitter.com/#!/search/%23bigdata

Hive http://hive.apache.org
Excel to Hadoop via Hive ODBC http://tinyurl.com/7c4qjjj
Hadoop On Azure Videos http://tinyurl.com/6munnx2
Klout http://tinyurl.com/6qu9php
Microsoft Big Data http://microsoft.com/bigdata
Denny Lee http://dennyglee.com/category/bigdata/
Carl Nolan http://tinyurl.com/6wbfxy9
Cindy Gross http://tinyurl.com/SmallBitesBigData

@SQLCindy / @SQLCATWoman
http://blogs.msdn.com/cindygross

Krux is an infrastructure provider for many of the websites you use online today, like NYTimes.com, WSJ.com, Wikia and NBCU. For every request on those properties, Krux will get one or more as well. We grew from zero traffic to several billion requests per day in the span of 2 years, and we did so exclusively in AWS. To make the right decisions in such a volatile environment, we knew that data is everything; without it, you can't possibly make informed decisions. However, collecting it efficiently, at scale, at minimal cost and without burdening developers is a tremendous challenge. Join me in this session to learn how we overcame this challenge at Krux; I will share with you the details of how we set up our global infrastructure, entirely managed by Puppet, to capture over a million data points every second on virtually every part of the system, including inside the web server, user apps and Puppet itself, for under $2000/month using off the shelf Open Source software and some code we've released as Open Source ourselves. In addition, I’ll show you how you can take (a subset of) these metrics and send them to advanced analytics and alerting tools like Circonus or Zabbix. This content will be applicable for anyone collecting or desiring to collect vast amounts of metrics in a cloud or datacenter setting and making sense of them.

Emphemeral hadoop clusters in the cloud

gfodor

This document discusses how Etsy uses ephemeral Hadoop clusters in the cloud to process and analyze their large amounts of data. They move data from databases and logs to S3, then use Cascading to run jobs that perform joins, grouping, etc. on the data in Hadoop. They also leverage Hadoop streaming to run MATLAB scripts for more complex analysis, and build a system called Barnum to coordinate jobs and return results. This approach allows them to flexibly scale processing from zero to thousands of nodes as needed in a cost effective and isolated manner.

Distributed tracing for Node.js

Nikolay Stoitsev

Winning with Big Data: Secrets of the Successful Data Scientist

Dataspora

A new class of professionals, called data scientists, have emerged to address the Big Data revolution. In this talk, I discuss nine skills for munging, modeling, and visualizing Big Data. Then I present a case study of using these skills: the analysis of billions of call records to predict customer churn at a North American telecom. http://en.oreilly.com/datascience/public/schedule/detail/15316

"Spark: from interactivity to production and back", Yurii Ostapchuk

Fwdays

Going from experiment to deployed prototype as fast as possible in a dynamic startup environment is invaluable. Being able to respond quickly to changes not less important. From interactive ad-hoc analysis to production applications with Spark and back - this is a story of one spirited engineer trying to make his life a little easier and a little more efficient while wrangling the data, writing Scala code, deploying Spark applications. The problems faced, the lessons learned, the options found and some smart solutions and ideas - this is what we will go through.

Capedwarf

Jonathan Franchesco Torres Baca

Константин Макарычев (Sofware Engineer): ИСПОЛЬЗОВАНИЕ SPARK ДЛЯ МАШИННОГО ОБ...

Provectus

"Apache Spark – опенсорсный движок для обработки больших объёмов данных. Помимо прочего, spark содержит в себе всё необходимое для машинного обучения, и это действительно просто до тех пор, пока не нужно использовать результаты на продакшне. Я расскажу, как работает machine learning на spark и в целом, как вывести всё это в продакшн, и что можно сделать из этого интересного"

121129 umls yes

Eunsil Yoon

The document describes the Unified Medical Language System (UMLS) which brings together many health and biomedical vocabularies and standards to enable interoperability between computer systems. It discusses the three main UMLS knowledge sources: the Metathesaurus, Semantic Network, and SPECIALIST Lexicon. The Metathesaurus contains over 100 vocabularies and provides concept unique identifiers to link synonymous terms. The Semantic Network categorizes terms into 133 semantic types and 54 relationships. The SPECIALIST Lexicon contains syntactic and morphological information to support natural language processing. An example journal article analyzing UMLS term occurrences in clinical notes is also mentioned.

This document provides an overview of 4 solutions for processing big data using Hadoop and compares them. Solution 1 involves using core Hadoop processing without data staging or movement. Solution 2 uses BI tools to analyze Hadoop data after a single CSV transformation. Solution 3 creates a data warehouse in Hadoop after a single transformation. Solution 4 implements a traditional data warehouse. The solutions are then compared based on benefits like cloud readiness, parallel processing, and investment required. The document also includes steps for installing a Hadoop cluster and running sample MapReduce jobs and Excel processing.

Hadoop HDFS Detailed Introduction

Hanborq Inc.

The document provides an overview of the Hadoop Distributed File System (HDFS). It describes HDFS's master-slave architecture with a single NameNode master and multiple DataNode slaves. The NameNode manages filesystem metadata and data placement, while DataNodes store data blocks. The document outlines HDFS components like the SecondaryNameNode, DataNodes, and how files are written and read. It also discusses high availability solutions, operational tools, and the future of HDFS.

Cluster - spark

HyeonSeok Choi

Introduction To Map Reduce

rantav

This document provides an overview of MapReduce, a programming model developed by Google for processing and generating large datasets in a distributed computing environment. It describes how MapReduce abstracts away the complexities of parallelization, fault tolerance, and load balancing to allow developers to focus on the problem logic. Examples are given showing how MapReduce can be used for tasks like word counting in documents and joining datasets. Implementation details and usage statistics from Google demonstrate how MapReduce has scaled to process exabytes of data across thousands of machines.

introduction to data processing using Hadoop and Pig

Ricardo Varela

Pig, Making Hadoop Easy

Nick Dimiduk

This document introduces Pig, an open source platform for analyzing large datasets that sits on top of Hadoop. It provides an example of using Pig Latin to find the top 5 most visited websites by users aged 18-25 from user and website data. Key points covered include who uses Pig, how it works, performance advantages over MapReduce, and upcoming new features. The document encourages learning more about Pig through online documentation and tutorials.

learning spark - Chatper8. Tuning and Debugging

Mungyu Choi

Integration of Hive and HBase

Hortonworks

Apache Hive provides SQL-like access to your stored data in Apache Hadoop. Apache HBase stores tabular data in Hadoop and supports update operations. The combination of these two capabilities is often desired, however, the current integration show limitations such as performance issues. In this talk, Enis Soztutar will present an overview of Hive and HBase and discuss new updates/improvements from the community on the integration of these two projects. Various techniques used to reduce data exchange and improve efficiency will also be provided.

HADOOP ONLINE TRAINING

Santhosh Sap

HADOOP ONLINE TRAINING

training3

This document provides an overview of the Hadoop Online Training course offered by AcuteSoft. The course covers Hadoop architecture including HDFS, MapReduce and YARN. It also covers related tools like HBase, Hive, Pig and Sqoop. The course includes lectures, demonstrations and hands-on exercises on Hadoop installation, configuration, administration and development tasks. It also includes a proof of concept mini project on analyzing Facebook data using Hive. Contact information is provided for free demo and pricing.

Viewers also liked

Hadoop overviewHyeonSeok Choi

Hadoop 제주대

DaeHeon Oh

하둡완벽가이드 Ch9HyeonSeok Choi

Hdfs

Mungyu Choi

하둡완벽가이드 Ch6. 맵리듀스 작동 방법

HyeonSeok Choi

hadoop ch1

Mungyu Choi

Apache Hadoop Java API

Adam Kawa

빅데이터, big data

H K Yoon

about hadoop yes

Eunsil Yoon

Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)

Matthew (정재화)

하둡 HDFS 훑어보기

beom kyun choi

Hadoop Introduction (1.0)

Keeyong Han

Big data Hadoop Analytic and Data warehouse comparison guide

Danairat Thanabodithammachari

Hadoop HDFS Detailed Introduction

Hanborq Inc.

Cluster - spark

HyeonSeok Choi

Introduction To Map Reduce

rantav

introduction to data processing using Hadoop and Pig

Ricardo Varela

Pig, Making Hadoop Easy

Nick Dimiduk

learning spark - Chatper8. Tuning and Debugging

Mungyu Choi

Integration of Hive and HBase

Hortonworks

Viewers also liked (20)

Hadoop overview

Hadoop 제주대

하둡완벽가이드 Ch9

Hdfs

하둡완벽가이드 Ch6. 맵리듀스 작동 방법

hadoop ch1

Apache Hadoop Java API

빅데이터, big data

about hadoop yes

Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)

하둡 HDFS 훑어보기

Hadoop Introduction (1.0)

Big data Hadoop Analytic and Data warehouse comparison guide

Hadoop HDFS Detailed Introduction

Cluster - spark

Introduction To Map Reduce

introduction to data processing using Hadoop and Pig

Pig, Making Hadoop Easy

learning spark - Chatper8. Tuning and Debugging

Integration of Hive and HBase

Similar to Introduction to Microsoft Hadoop

HADOOP ONLINE TRAINING

Santhosh Sap

HADOOP ONLINE TRAINING

training3

Getting your Big Data on with HDInsight

Simon Elliston Ball

CloudOps CloudStack Days, Austin April 2015

CloudOps2005

Orienit hadoop practical cluster setup screenshots

Kalyan Hadoop

The document contains screenshots and descriptions of the setup and configuration of a Hadoop cluster. It includes images showing the cluster with different numbers of live and dead nodes, replication settings across nodes, and outputs of commands like fsck and job execution information. The screenshots demonstrate how to view cluster health metrics, manage nodes, and run MapReduce jobs on the Hadoop cluster.

NYC_2016_slides

Nathan Halko

This document provides an overview of a machine learning workshop including tutorials on decision tree classification for flight delays, clustering news articles with k-means clustering, and collaborative filtering for movie recommendations using Spark. The tutorials demonstrate loading and preparing data, training models, evaluating performance, and making predictions or recommendations. They use Spark MLlib and are run in Apache Zeppelin notebooks.

Uotm workshop

Ravi Patel

This document provides an overview of Big Data and Hadoop concepts, architectures, and hands-on demonstrations using Microsoft Azure HDInsight. It begins with definitions of Big Data and Hadoop, then demonstrates sample end-to-end architectures using Azure services. Hands-on labs explore creating storage, streaming jobs, and querying data using HDInsight. The document emphasizes that Hadoop is well-suited for large-scale data exploration and analytics on unknown datasets. It shows how running Hadoop on Azure provides elasticity, low costs, and easier management compared to on-premises Hadoop deployments.

Big Data in the Cloud - Montreal April 2015

Cindy Gross

Big Data Integration Webinar: Getting Started With Hadoop Big Data

Pentaho

This document discusses getting started with big data analytics using Hadoop and Pentaho. It provides an overview of installing and configuring Hadoop and Pentaho on a single machine or cluster. Dell's Crowbar tool is presented as a way to quickly deploy Hadoop clusters on Dell hardware in about two hours. The document also covers best practices like leveraging different technologies, starting with small datasets, and not overloading networks. A demo is given and contact information provided.

HP Helion European Webinar Series ,Webinar #3

BeMyApp

The document discusses building cloud native applications using the Helion Development Platform. It provides information on connecting applications to services like databases, using buildpacks to deploy different programming languages, and Windows support in Cloud Foundry including .NET applications and SQL Server. The presentation includes code examples and polls questions to engage webinar participants.

2016 05-cloudsoft-amp-and-brooklyn-new

BradDesAulniers2

Practical Hadoop Big Data Training Course by Certified Architect

Kamal A

Hadoop content

Hadoop online training

This 40-hour course provides training to become a Hadoop developer. It covers Hadoop and big data fundamentals, Hadoop file systems, administering Hadoop clusters, importing and exporting data with Sqoop, processing data using Hive, Pig, and MapReduce, the YARN architecture, NoSQL programming with MongoDB, and reporting tools. The course includes hands-on exercises, datasets, installation support, interview preparation, and guidance from instructors with over 8 years of experience working with Hadoop.

Zeronights 2015 - Big problems with big data - Hadoop interfaces security

Jakub Kałużny

Did "cloud computing" and "big data" buzzwords bring new challenges for security testers? Apart from complexity of Hadoop installations and number of interfaces, standard techniques can be applied to test for: web application vulnerabilities, SSL security and encryption at rest. We tested popular Hadoop environments and found a few critical vulnerabilities, which for sure cast a shadow on big data security.

Effective DevOps by using Docker and Chef together !

WhiteHedge Technologies Inc.

Big problems with big data – Hadoop interfaces security

SecuRing

Hadoop and Mapreduce Certification

Vskills

Vskills certification for Hadoop and Mapreduce assesses the candidate for skills on Hadoop and Mapreduce platform for big data applications. The certification tests the candidates on various areas in Hadoop and Mapreduce which includes knowledge of Hadoop, Mapreduce, their configuration and administration, cluster installation and configuration, using pig, zookeeper and Hbase. http://www.vskills.in/certification/Certified-Hadoop-and-Mapreduce-Professional

Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture

Skillspeed

This Hadoop Hive Tutorial will unravel the complete Introduction to Hive, Hive Architecture, Hive Commands, Hive Fundamentals & HiveQL. In addition to this, even fundamental concepts of BIG Data & Hadoop are extensively covered. At the end, you'll have a strong knowledge regarding Hadoop Hive Basics. PPT Agenda ✓ Introduction to BIG Data & Hadoop ✓ What is Hive? ✓ Hive Data Flows ✓ Hive Programming ---------- What is Apache Hive? Apache Hive is a data warehousing infrastructure built over Hadoop which is targeted towards SQL programmers. Hive permits SQL programmers to directly enter the Hadoop ecosystem without any pre-requisites in Java or other programming languages. HiveQL is similar to SQL, it is utilized to process Hadoop & MapReduce operations by managing & querying data. ---------- Hive has the following 5 Components: 1. Driver 2. Compiler 3. Shell 4. Metastore 5. Execution Engine ---------- Applications of Hive 1. Data Mining 2. Document Indexing 3. Business Intelligence 4. Predictive Modelling 5. Hypothesis Testing ---------- Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance. Email: sales@skillspeed.com Website: https://www.skillspeed.com

Capital onehadoopintro

Doug Chang

This document provides an overview of Capital One's plans to introduce Hadoop and discusses several proof of concepts (POCs) that could be developed. It summarizes the history and practices of using Hadoop at other companies like LinkedIn, Netflix, and Yahoo. It then outlines possible POCs for Hadoop distributions, ETL/analytics frameworks, performance testing, and developing a scaling layer. The goal is to contribute open source code and help with Capital One's transition to using Hadoop in production.

MongoDB & Hadoop, Sittin' in a Tree

MongoDB

K Young, CEO of Mortar, gave a presentation on using MongoDB and Hadoop/Pig together. He began with a brief introduction to Hadoop and Pig, explaining their uses for processing large datasets. He then demonstrated how to load data from MongoDB into Pig using a connector, and store data from Pig back into MongoDB. The rest of the presentation focused on use cases for combining MongoDB and Pig, such as being able to separately manage data storage and processing. Young also showed some Mortar utilities for working with MongoDB data in Pig.

Similar to Introduction to Microsoft Hadoop (20)

HADOOP ONLINE TRAINING

Getting your Big Data on with HDInsight

CloudOps CloudStack Days, Austin April 2015

Orienit hadoop practical cluster setup screenshots

NYC_2016_slides

Uotm workshop

Big Data in the Cloud - Montreal April 2015

Big Data Integration Webinar: Getting Started With Hadoop Big Data

HP Helion European Webinar Series ,Webinar #3

2016 05-cloudsoft-amp-and-brooklyn-new

Practical Hadoop Big Data Training Course by Certified Architect

Hadoop content

Zeronights 2015 - Big problems with big data - Hadoop interfaces security

Effective DevOps by using Docker and Chef together !

Big problems with big data – Hadoop interfaces security

Hadoop and Mapreduce Certification

Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture

Capital onehadoopintro

MongoDB & Hadoop, Sittin' in a Tree

Recently uploaded

Building Production Ready Search Pipelines with Spark and Milvus

Zilliz

Pushing the limits of ePRTC: 100ns holdover for 100 days

Adtran

Mariano G Tinti - Decoding SpaceX

Mariano Tinti

“I’m still / I’m still / Chaining from the Block”

Claudio Di Ciccio

Programming Foundation Models with DSPy - Meetup Slides

Zilliz

Driving Business Innovation: Latest Generative AI Advancements & Success Story

Safe Software

Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency. During the hour, we’ll take you through: Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board. Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes. Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI. We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI. This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!

National Security Agency - NSA mobile device best practices

Quotidiano Piemontese

20240605 QFM017 Machine Intelligence Reading List May 2024

Matthew Sinclair

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

Neo4j

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack

shyamraj55

RESUME BUILDER APPLICATION Project for students

KAMESHS29

Serial Arm Control in Real Time Presentation

tolgahangng

Artificial Intelligence for XMLDevelopment

Octavian Nadolu

In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject. We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup. Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved. The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring. The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise. By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

Neo4j

Dr. Sean Tan, Head of Data Science, Changi Airport Group Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.

How to Get CNIC Information System with Paksim Ga.pptx

danishmna97

HCL Notes and Domino License Cost Reduction in the World of DLAU

panagenda

Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/ The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this! We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model. Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward. These topics will be covered - Reducing license cost by finding and fixing misconfigurations and superfluous accounts - How do CCB and CCX licenses really work? - Understanding the DLAU tool and how to best utilize it - Tips for common problem areas, like team mailboxes, functional/test users, etc - Practical examples and best practices to implement right away

Infrastructure Challenges in Scaling RAG with Custom AI models

Zilliz

Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.

Introduction to CHERI technology - Cybersecurity

mikeeftimakis1

How to use Firebase Data Connect For Flutter

Daiki Mogmet Ito

Climate Impact of Software Testing at Nordic Testing Days

Kari Kakkonen

My slides at Nordic Testing Days 6.6.2024 Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.

Recently uploaded (20)

Building Production Ready Search Pipelines with Spark and Milvus

Pushing the limits of ePRTC: 100ns holdover for 100 days

Mariano G Tinti - Decoding SpaceX

“I’m still / I’m still / Chaining from the Block”

Programming Foundation Models with DSPy - Meetup Slides

Driving Business Innovation: Latest Generative AI Advancements & Success Story

National Security Agency - NSA mobile device best practices

20240605 QFM017 Machine Intelligence Reading List May 2024

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack

RESUME BUILDER APPLICATION Project for students

Serial Arm Control in Real Time Presentation

Artificial Intelligence for XMLDevelopment

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

How to Get CNIC Information System with Paksim Ga.pptx

HCL Notes and Domino License Cost Reduction in the World of DLAU

Infrastructure Challenges in Scaling RAG with Custom AI models

Introduction to CHERI technology - Cybersecurity

How to use Firebase Data Connect For Flutter

Climate Impact of Software Testing at Nordic Testing Days

Introduction to Microsoft Hadoop

1. INTRODUCTION TO HADOOP Cindy Gross | @SQLCindy | SQLCAT PM http://blogs.msdn.com/cindygross

4. SELECT deviceplatform, state, c ountry FROM hivesampletable LIMIT 200;

5. Sqoop to/from relational

6. Ha do op Not fully pre- structured

7. Hadoop Ecosystem Snapshot

10.

11.

12. Open Source Apache Hadoop

13.

14.

15.

16.

17.

18.

19. Hadoop: The Definitive Guide by Tom White SQL Server Sqoop http://bit.ly/rulsjX JavaScript http://bit.ly/wdaTv6 Twitter https://twitter.com/#!/search/%23bigdata Hive http://hive.apache.org Excel to Hadoop via Hive ODBC http://tinyurl.com/7c4qjjj Hadoop On Azure Videos http://tinyurl.com/6munnx2 Klout http://tinyurl.com/6qu9php Microsoft Big Data http://microsoft.com/bigdata Denny Lee http://dennyglee.com/category/bigdata/ Carl Nolan http://tinyurl.com/6wbfxy9 Cindy Gross http://tinyurl.com/SmallBitesBigData

20. @SQLCindy / @SQLCATWoman http://blogs.msdn.com/cindygross

21. INTRODUCTION TO HADOOP Cindy Gross | @SQLCindy | SQLCAT PM http://blogs.msdn.com/cindygross

Editor's Notes

Hadoop is part of NOSQL (Not Only SQL) and it’s a bit wild. You explore in/with Hadoop. You learn new things. You test hypotheses on unstructured jungle data. You eliminate noise.Then you take the best learnings and share them with the world via a relational or multidimensional database.Atomicity, consistency, isolation, durability (ACID) is used in relational databases to ensure immediate consistency. But what if eventual consistency is good enough? In stomps BASE – Basically available, soft state, eventual consistencyScale up or scale out?Pay up front or pay as you go?Which IT skills do you utilize?
Hive is a database that sits on top of Hadoop. HiveQL (HQL) generates (possibly multiple) MapReduce programs to execute the joins, filters, aggregates, etc. The language is very SQL-like, perhaps closer to MySQL but still very familiar.
Get your data from anywhere. There’s a data explosion and we can now use more of it than ever before. The HadoopOnAzure.com portal provides an easy interface to pull in data from sources including secure FTP, Amazon S3, Azure blob store, Azure Data Market. Use Sqoop to move data between Hadoop and SQL Server, PDW, SQL Azure. The Hive ODBC driver lets you display Hive data in Excel or apps.
Many equate big data to MapReduce and in particular Hadoop. However, other applications like streaming, machine learning, and PDW type systems can also be described as big data solutions. Big Data is unstructured, flows fast, has many formats, and/or has quickly changing formats. How big is “big” really depends on what is too big/complex for your environment (hardware, people, software, processes). It’s done by scaling out on commodity (low end enterprise level) hardware.
Big data solutions are comprised of matching the right set of tools to the right set of problems (architectures are compositional, not monolithic)Need to select appropriate combinations of storage, analytics and consumers.
For demo steps see: http://blogs.msdn.com/b/cindygross/archive/2012/05/07/load-data-from-the-azure-datamarket-to-hadoop-on-azure-small-bites-of-big-data.aspx
Big data is often described as problems that have one or more of the 3 (or 4) Vs – volume, velocity, variety, variability. Think about big data when you describe a problem with terms like tame the chaos, reduce the complexity, explore, I don’t know what I don’t know, unknown unknowns, unstructured, changing quickly, too much for what my environment can handle now, unused data.Volume = more data than the current environment can handle with vertical scaling, need to make sure of data that it is currently too expensive to useVelocity = Small decision window compared to data change rate, ask how quickly you need to analyze and how quickly data arrivesVariety = many different formats that are expensive to integrate, probably from many data sources/feedsVariability = many possible interpretations of the data
It’s not the hammer for every problem and it’s not the answer to every large store of data. It does not replace relational or multi-dimensional dbs, it’s a solution to a different sort of problem. It’s a new, specialized type of db for certain scenarios. It will feed other types of dbs.
Microsoft takes what is already there, makes it run on Windows, and offers the option of full control or simplificationHadoop in the cloud simplifies managementHadoop on Windows lets you reuse existing skillsJavaScript opens up more hiring optionsHive ODBC Driver / Excel Addin lets you combine data, move dataSqoop moves data – Linux based version to/from SQL available now, Windows based soon
Demo2 –Mashup1) Hive Panea. Excel, blank worksheet, datab. Use your HadoopOnAzure clusterc. Object = Gender2007 or whatever table you pre-loaded in Hive (select * from gender2007 limit 200)d. KEY POINT = pulled data from multiple files across many nodes and displayed via ODBC is user friendly format – not easy in Hadoop world2) PowerPivota. KEY POINTS = uses local memory, pulls data from multiple data sources (structured and unstructured), can be stored/scheduled in Sharepoint, creates relationships to add value -- MASHUPb. Excel file DeviceAnalysisByRegion.xlsx (worksheet with region/country data, relationship defined between Gender2007 country and this country data), click on PowerPivot tab and open blank tabc. Click on PowerPivot Window – show each tab is data from a different source – hivesampletable (Hadoop/unstructured) and regions (could be anything/structured)d. Click on diagram view – show relationships, rich valuee. Pivot table.pivotchart.newf. Close hive query windowg. Values = count of platform, axis=platform, zoom to selectionh. Slicers Vertical = regions hierarchyi. Region = North America, country = Canada == Windows Phone jokesj. KEY Load to Sharepoint, schedule refreshes, use for Power View
Expand your audience of decision makers by making BI easier with self-service, visualizationOur products interact and work together + one company for questions/issuesUse existing hardware, employeesExpand options for hiring/training/re-training with familiar tools Familiar tools = less rampup timeCloud = elasticity, easy scale up/down, pay for what you useEasier to move data to/from HDFS
It’s about separating the signal from the noise so you have insight to make decisions to take action. Discover, explore, gain insight.
Familiar tools, new tools, ease of use
Take action! All the exploring doesn’t help if you don’t do something! Something might be starting another round of exploring, but eventually DO SOMETHING!

Introduction to Microsoft Hadoop

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Microsoft Hadoop

Similar to Introduction to Microsoft Hadoop (20)

Recently uploaded

Recently uploaded (20)

Introduction to Microsoft Hadoop

Editor's Notes