This document provides an agenda and overview for a conference session on Big Data and NoSQL for database and BI professionals held from April 10-12 in Chicago, IL. The session will include an overview of big data and NoSQL technologies, then deeper dives into Hadoop, NoSQL databases like HBase, and tools like Hive, Pig, and Sqoop. There will also be demos of technologies like HDInsight, Elastic MapReduce, Impala, and running MapReduce jobs.
Cloud Computing and the Microsoft Developer - A Down-to-Earth AnalysisAndrew Brust
Slides from my Keynote at Visual Studio Live Las Vegas 2011 (Day 2).
Closely compares Azure to AWS, and discusses Force.com, Google, Rackspace, VMWare and Red Hat.
Discussion includes capabilities, pricing, strategy.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
An unprecedented amount of data is being created and is accessible. This presentation will instruct on using the new NoSQL technologies to make sense of all this data.
This presentation provides an introduction to Azure DocumentDB. Topics include elastic scale, global distribution and guaranteed low latencies (with SLAs) - all in a managed document store that you can query using SQL and Javascript. We also review common scenarios and advanced Data Sciences scenarios.
RDX Insights Presentation - Microsoft Business IntelligenceChristopher Foot
May's RDX Insights Series Presentation focuses on Microsoft's BI products. We begin with an overview of Power BI, SSIS, SSAS and SSRS and how the products integrate with each other. The webinar continues with a detailed discussion on how to use Power BI to capture, model, transform, analyze and visualize key business metrics. We’ll finish with a Power BI demo highlighting some of its most beneficial and interesting features.
First introduced with the Analytics Platform System (APS), PolyBase simplifies management and querying of both relational and non-relational data using T-SQL. It is now available in both Azure SQL Data Warehouse and SQL Server 2016. The major features of PolyBase include the ability to do ad-hoc queries on Hadoop data and the ability to import data from Hadoop and Azure blob storage to SQL Server for persistent storage. A major part of the presentation will be a demo on querying and creating data on HDFS (using Azure Blobs). Come see why PolyBase is the “glue” to creating federated data warehouse solutions where you can query data as it sits instead of having to move it all to one data platform.
MongoDB and Hadoop: Driving Business InsightsMongoDB
MongoDB and Hadoop can work together to solve big data problems facing today's enterprises. We will take an in-depth look at how the two technologies complement and enrich each other with complex analyses and greater intelligence. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights with MapReduce, Pig, and Hive, and demo a Spark application to drive product recommendations.
Big Data projects overview at EMC Labs China
• Introduction to Cloud Databases
• Data analytics in the cloud
– Parallel DBMS
– MapReduce
• FlexDB - A cloud-scale database engine based on Hadoop
Cloud Computing and the Microsoft Developer - A Down-to-Earth AnalysisAndrew Brust
Slides from my Keynote at Visual Studio Live Las Vegas 2011 (Day 2).
Closely compares Azure to AWS, and discusses Force.com, Google, Rackspace, VMWare and Red Hat.
Discussion includes capabilities, pricing, strategy.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
An unprecedented amount of data is being created and is accessible. This presentation will instruct on using the new NoSQL technologies to make sense of all this data.
This presentation provides an introduction to Azure DocumentDB. Topics include elastic scale, global distribution and guaranteed low latencies (with SLAs) - all in a managed document store that you can query using SQL and Javascript. We also review common scenarios and advanced Data Sciences scenarios.
RDX Insights Presentation - Microsoft Business IntelligenceChristopher Foot
May's RDX Insights Series Presentation focuses on Microsoft's BI products. We begin with an overview of Power BI, SSIS, SSAS and SSRS and how the products integrate with each other. The webinar continues with a detailed discussion on how to use Power BI to capture, model, transform, analyze and visualize key business metrics. We’ll finish with a Power BI demo highlighting some of its most beneficial and interesting features.
First introduced with the Analytics Platform System (APS), PolyBase simplifies management and querying of both relational and non-relational data using T-SQL. It is now available in both Azure SQL Data Warehouse and SQL Server 2016. The major features of PolyBase include the ability to do ad-hoc queries on Hadoop data and the ability to import data from Hadoop and Azure blob storage to SQL Server for persistent storage. A major part of the presentation will be a demo on querying and creating data on HDFS (using Azure Blobs). Come see why PolyBase is the “glue” to creating federated data warehouse solutions where you can query data as it sits instead of having to move it all to one data platform.
MongoDB and Hadoop: Driving Business InsightsMongoDB
MongoDB and Hadoop can work together to solve big data problems facing today's enterprises. We will take an in-depth look at how the two technologies complement and enrich each other with complex analyses and greater intelligence. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights with MapReduce, Pig, and Hive, and demo a Spark application to drive product recommendations.
Big Data projects overview at EMC Labs China
• Introduction to Cloud Databases
• Data analytics in the cloud
– Parallel DBMS
– MapReduce
• FlexDB - A cloud-scale database engine based on Hadoop
Agenda
- What is NOSQL?
- Motivations for NOSQL?
- Brewer’s CAP Theorem
- Taxonomy of NOSQL databases
- Apache Cassandra
- Features
- Data Model
- Consistency
- Operations
- Cluster Membership
- What Does NOSQL means for RDBMS?
A practical introduction to Oracle NoSQL Database - OOW2014Anuj Sahni
Not familiar with Oracle NoSQL Database yet? This great product introduction session discusses the primary functionality included with the product as well as integration with other Oracle products. It includes a live demo that illustrates installation and configuration as well as data modeling and sample NoSQL application development.
An Intro to NoSQL Databases -- NoSQL databases will not become the new dominators. Relational will still be popular, and used in the majority of situations. They, however, will no longer be the automatic choice. (source : http://martinfowler.com/)
Presented at Cassandra London (April 7, 2014); The challenges of time-series storage and analytics in OpenNMS, with an introduction to Newts, a new Cassandra-based time-series data store.
Operational Analytics Using Spark and NoSQL Data StoresDATAVERSITY
NoSQL data stores have emerged for scalable capture and real-time analysis of data. Apache Spark and Hadoop provide additional scalable analytics processing. This session looks at these technologies and how they can be used to support operational analytics to improve operational effectiveness. It also looks at an example of how operational analytics can be implemented in NoSQL environments using the Basho Data Platform with Apache Spark:
•The emergence of NoSQL, Hadoop and Apache Spark
•NoSQL Use Cases
•The need for operational analytics
•Types of operational analysis
•Key requirements for operational analytics
•Operational analytics using the Basho Data Platform with Apache Spark.
Overview of the different data models, mainly: flat file, hierarchical, network, relational, and object-oreitned. CAP theorem, NoSQL major four models: Document-oriented, Column-oriented, Key-Value store, and Graph. Followed by an overview of some of the famous no-sql products: Redis, Cassandra, MongoDB, and Neo4j.
Time Series data is proliferating with literally every step that we take, just think about things like Fit Bit bracelets that track your every move and financial trading data all of which is timestamped.
Time series data requires high performance reads and writes even with a huge number of data sources. Both speed and scale are integral to success, which makes for a unique challenge for your database.
A time series NoSQL data model requires flexibility to support unstructured, and semi-structured data as well as the ability to write range queries to analyze your time series data. So how can you tackle speed, scale and flexibility all at once?
Join Professional Services Architect Drew Kerrigan and Developer Advocate Matt Brender for a discussion of:
Examples of time series data sets, from IoT to Finance to jet engines
What makes time series queries different from other database queries
How to model your dataset to answer the right questions about your data
How to store, query and analyze a set of time series data points
Learn how a NoSQL database model and Riak TS can help you address the unique challenges of time series data.
AWS is an incredibly popular environment for running MongoDB deployments. Today you have many choices about instance type, storage, network config, security, how you configure MongoDB processes, and more. In addition, you now have options when it comes to tooling to help you manage and operate your deployment. In this session, we’ll take a look at several recommendations that can help you get the best performance out of AWS.
Introduction to Microsoft's Big Data Platform and Hadoop PrimerDenny Lee
This is my 24 Hour of SQL PASS (September 2012) presentation on Introduction to Microsoft's Big Data Platform and Hadoop Primer. All known as Project Isotope and HDInsight.
"NoSQL on the move" by Glynn Bird
Mobile-first app web development is a solved problem, but how can you websites and apps the continue to work with little or internet connectivity? Discover how Offline-first development allows apps to present an "always on" experience for their user
Come può .NET contribuire alla Data Science? Cosa è .NET Interactive? Cosa c'entrano i notebook? E Apache Spark? E il pythonismo? E Azure? Vediamo in questa sessione di mettere in ordine le idee.
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsightŁukasz Grala
Sesja or ozwiązaniu Big Data Analytics Microsoft. Jest to Hortonowrks (HADOOP, HBase, Storm, Spark), wraz z wydajnym R Server. Zaawansowana analityka przy użyciui RevoScaleR
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
Drawn from Think Big's experience on real-world client projects, Think Big Academy Director and Principal Architect Jeffrey Breen will review specific ways to integrate NoSQL databases into Hadoop-based Big Data systems: preserving state in otherwise stateless processes; storing pre-computed metrics and aggregates to enable interactive analytics and reporting; and building a secondary index to provide low latency, random access to data stored stored on the high latency HDFS. A working example of secondary indexing is presented in which MongoDB is used to index web site visitor locations from Omniture clickstream data stored on HDFS.
Similar to Big Data and NoSQL for Database and BI Pros (20)
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
3. Meet Andrew
CEO and Founder, Blue Badge Insights
Big Data blogger for ZDNet
Microsoft Regional Director, MVP
Co-chair VSLive! and 17 years as a speaker
Founder, Microsoft BI User Group of NYC
• http://www.msbinyc.com
Co-moderator, NYC .NET Developers Group
• http://www.nycdotnetdev.com
“Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News
brustblog.com, Twitter: @andrewbrust
3
5. Lynn Langit (in absentia)
CEO and Founder, Lynn Langit consulting
Former Microsoft Evangelist (4 years)
Google Developer Expert
MongoDB Master
MCT 13 years – 7 certifications
Cloudera Certified Developer
MSDN Magazine articles
• SQL Azure
• Hadoop on Azure
• MongoDB on Azure
www.LynnLangit.com
@LynnLangit
L
7. Agenda
Overview / Landscape
• Big Data, and Hadoop
• NoSQL
• The Big Data-NoSQL Intersection
Drilldown on Big Data
Drilldown on NoSQL
8. What is Big Data?
100s of TB into PB and higher
Involving data from: financial data, sensors, web logs, social media, etc.
Parallel processing often involved
Hadoop is emblematic, but other technologies are Big Data too
Processing of data sets too large for transactional databases
Analyzing interactions, rather than transactions
The three V‟s: Volume, Velocity, Variety
Big Data tech sometimes imposed on small data problems
9. Big Data = Exponentially More Data
Retail Example -> „Feedback Economy‟
• Number of transactions
• Number of behaviors (collected every minute)
9
L
10. Big Data = „Next State‟ Questions
10
• What could happen?
• Why didn‟t this happen?
• When will the next new thing
happen?
• What will the next new thing be?
• What happens?
Collecting
Behavioral
data
L
11. My Data: An Example from Health Care
Medical records
• Regular
• Emergency
• Genetic data – 23andMe
Food data
• SparkPeople
Purchasing
• Grocery card
• credit card
Search – Google
Social media
• Twitter
• Facebook
Exercise
• Nike Fuel Band
• Kinect
• Location - phone
11
L
13. Big Data Considerations
Collection –
get the data
Storage –
keep the
data
Querying –
make sense
of the data
Visualization
– see the
business
value
L
14. Data Collection
Types of Data
• Structured, semi-structured, unstructured vs. data standards
• Behavioral vs. transactional data
Methods of collection
• Sensors everywhere
• Machine-2-Machine
• Public Datasets
• Freebase
• Azure DataMarket
• Hillary Mason‟s list
14
L
15. What‟s MapReduce?
Partition the bulk input data and send to mappers (nodes in cluster)
Mappers pre-process, put into key-value format, and send all output for
a given (set of) key(s) to a reducer
Reducer aggregates; one output per key, with value
Map and Reduce code natively written as Java functions
17. • Count by suite, on each floor
• Send per-suite, per platform totals to lobby
• Sort totals by platform
• Send two platform packets to 10th, 20th, 30th floor
• Tally up each platform
• Merge tallies into one spreadsheet
• Collect the tallies
A MapReduce Example
18. What‟s a Distributed File System?
One where data gets distributed over commodity drives on commodity
servers
Data is replicated
• If one box goes down, no data lost
• “Shared Nothing”
BUT: Immutable
• Files can only be written to once
• So updates require drop + re-write (slow)
• You can append though
• Like a DVD/CD-ROM
19. Hadoop = MapReduce + HDFS
Modeled after Google MapReduce + GFS
Have more data? Just add more nodes to cluster.
• Mappers execute in parallel
• Hardware is commodity
• “Scaling out”
Use of HDFS means data may well be local to mapper processing
• So, not just parallel, but minimal data movement, which avoids
network bottlenecks
20. Comparison: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Updates Read / Write many times Write once, Read many times
Integrity High (ACID) Low
Query Response Time Can be near immediate Has latency (due to batch
processing)
20
L
21. Just-in-Time Schema
When looking at unstructured data, schema is imposed at query time
Schema is context specific
• If scanning a book, are the values words, lines, or pages?
• Are notes a single field, or is each word a value?
• Are date and time two fields or one?
• Are street, city, state, zip separate or one value?
• Pig and Hive let you determine this at query time
• So does the Map function in MapReduce code
22. What‟s HBase?
A Wide-Column Store NoSQL database
Modeled after Google BigTable
Uses HDFS
Therefore, Hadoop-compatible
Hadoop MapReduce often used with HBase
But you can use either without the other
24. NoSQL Confusion
Many „flavors‟ of NoSQL data stores
Easiest to group by functionality, but…
• Dividing lines are not clear or consistent
NoSQL choice(s) driven by many factors
• Type of data
• Quantity of data
• Knowledge of technical staff
• Product maturity
• Tooling
L
25. So much wrong information
Everything is
„new‟
People are
religious about
data storage
Lots of
incorrect
information
„Try‟ before
you „buy‟ (or
use)
Watch out for
over
simplification
Confusion over
vendor
offerings
L
26. Common NoSQL Misconceptions
Problems
Everything is „new‟
People are religious about data
storage
Open source is always cheaper
Cloud is always cheaper
Replace RDBMS with NoSQL
Solutions
„Try‟ before you „buy‟ (or use)
Leverage NoSQL communities
Add NoSQL to existing RDBMS
solution
L
28. The Hadoop Stack
MapReduce, HDFS
Database
RDBMS Import/Export
Query: HiveQL and Pig Latin
Machine Learning/Data Mining
Log file integration
29. What‟s Hive?
Began as Hadoop sub-project
Now top-level Apache project
Provides a SQL-like (“HiveQL”) abstraction over MapReduce
Has its own HDFS table file format (and it‟s fully schema-bound)
Can also work over HBase
Acts as a bridge to many BI products which expect tabular data
31. Microsoft HDInsight
Developed with Hortonworks and incorporates Hortonworks Data
Platform (HDP) for Windows
Windows Azure HDInsight and Microsoft HDInsight (for Windows
Server)
• Single node preview runs on Windows client
Includes ODBC Driver for Hive
JavaScript MapReduce framework
Contribute it all back to open source Apache Project
32. Hortonworks Data Platform
for Windows
MRLib (NuGet
Package)
LINQ to Hive
OdbcClient + Hive
ODBC Driver
Deployment
Debugging
MR code in
C#, HadoopJob, Mapp
erBase, ReducerBase
Amenities for
Visual Studio/.NET
33. Some ways to work
Microsoft HDInsight
• Cloud: go to www.windowsazure.com, request a cluster
• Local: Download Microsoft HDInsight
• Runs on just about anything, including Windows XP
• Get it via the Web Platform installer (WebPI)
• Local version is free; cloud billed at 50% discount during preview
Amazon Web Services Elastic MapReduce
• Create AWS account
• Select Elastic MapReduce in Dashboard
• Cheap for experimenting, but not free
Cloudera CDH VM image
• Download as .tar.gz file
• “Un-tar” (can use WinRAR, 7zip)
• Run via VMWare Player or Virtual Box
• Everything’s free
35. Microsoft HDInsight
Much simpler than the others
Browser-based portal
• Launch MapReduce jobs
• Azure: Provisioning cluster, managing ports, gather external data
Interactive JavaScript & Hive console
• JS: HDFS, Pig, light data visualization
• Hive commands and metadata discovery
• New console coming
Desktop Shortcuts:
• Command window, MapReduce, Name Node status in browser
• Azure: from portal page you can RDP directly to Hadoop head node for these
desktop shortcuts
35
36. April 10-12 | Chicago, IL
Demo
Windows Azure HDInsight
37. Amazon Elastic MapReduce
Lots of steps!
At a high level:
• Setup AWS account and S3 “buckets”
• Generate Key Pair and PEM file
• Install Ruby and EMR Command Line Interface
• Provision the cluster using CLI
• A batch file can work very well here
• Setup and run SSH/PuTTY
• Work interactively at command line
38. April 10-12 | Chicago, IL
Demo
Amazon Elastic MapReduce
39. Cloudera CDH4 Virtual Machine
Get it for free, in VMWare and Virtual Box versions.
• VMWare player and Virtual Box are free too
Run it, and configure it to have its own IP on your network. Use ifconfig to
discover IP.
Assuming IP of 192.168.1.59, open browser on your own (host) machine and
navigate to:
• http://192.168.1.59:8888
Can also use browser in VM and hit:
• http://localhost:8888
Work in “Hue”…
40. Hue
Browser based UI, with front
ends for:
HDFS (w/ upload & download)
MapReduce job creation and
monitoring
Hive (“Beeswax”)
And in-browser command line
shells for:
HBase
Pig (“Grunt”)
41. Impala: What it Is
Distributed SQL query engine over Hadoop cluster
Announced at Strata/Hadoop World in NYC on October 24th
In Beta, as part of CDH 4.1
Works with HDFS and Hive data
Compatible with HiveQL and Hive drivers
• Query with Beeswax
42. Impala: What it‟s Not
Impala is not Hive
• Hive converts HiveQL to Java MapReduce code and executes it in batch
mode
• Impala executes query interactively over the data
• Brings BI tools and Hadoop closer together
Impala is not an Apache Software Foundation project
• Though it is open source and Apache-licensed, but it‟s still incubated by
Cloudera
• Only in CDH
43. April 10-12 | Chicago, IL
Demo
Cloudera CDH4, Impala
44. Hadoop commands
HDFS
• hadoop fs filecommand
• Create and remove directories
• mkdir, rm, rmr
• Upload and download files to/from HDFS
• get, put
• View directory contents
• ls, lsr
• Copy, move, view files
• cp, mv, cat
MapReduce
• Run a Java jar-file based job
• hadoop jar jarname params
49. Submitting, Running and Monitoring
Jobs
Upload a JAR
Use Streaming
• Use other languages (i.e. other than Java) to write MapReduce code
• Python is popular option
• Any executable works, even C# console apps
• On MS HDInsight, JavaScript works too
• Still uses a JAR file: streaming.jar
Run at command line (passing JAR name and params) or use GUI
50. April 10-12 | Chicago, IL
Demo
Running MapReduce Jobs
51. Hive
Used by most BI products which connect to Hadoop
Provides a SQL-like abstraction over Hadoop
• Officially HiveQL, or HQL
Works on own tables, but also on HBase
Query generates MapReduce job, output of which becomes result set
Microsoft has Hive ODBC driver
• Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular
Mode (only)
52. Hive, Continued
Load data from flat HDFS files
• LOAD DATA [LOCAL] INPATH 'myfile'
INTO TABLE mytable;
SQL Queries
• CREATE, ALTER, DROP
• INSERT OVERWRITE (creates whole tables)
• SELECT, JOIN, WHERE, GROUP BY
• SORT BY, but ordering data is tricky!
• MAP/REDUCE/TRANSFORM…USING allows for custom map, reduce steps
utilizing Java or streaming code
53. Data Explorer
• Beta add-in for Excel
• Acquire, transform data
• Data sources include
Facebook, HDFS
• Visually- or script-driven
• Also includes Azure BLOB
storage backing up
HDInsight
56
55. Pig
Instead of SQL, employs a language (“Pig Latin”) that accommodates data flow
expressions
• Do a combo of Query and ETL
“10 lines of Pig Latin ≈ 200 lines of Java.”
Works with structured or unstructured data
Operations
• As with Hive, a MapReduce job is generated
• Unlike Hive, output is only flat file to HDFS or text at command line console
• With HDInsight, can easily convert to JavaScript array, then manipulate
Use command line (“Grunt”) or build scripts
56. Example
A = LOAD 'myfile'
AS (x, y, z);
B = FILTER A by x > 0;
C = GROUP B BY x;
D = FOREACH A GENERATE
x, COUNT(B);
STORE D INTO 'output';
57. Pig Latin Examples
Imperative, file system commands
• LOAD, STORE
•Schema specified on LOAD
Declarative, query commands (SQL-like)
• xxx = file or data set
• FOREACH xxx GENERATE (SELECT…FROM xxx)
• JOIN (WHERE/INNER JOIN)
• FILTER xxx BY (WHERE)
• ORDER xxx BY (ORDER BY)
• GROUP xxx BY / GENERATE COUNT(xxx)
(SELECT COUNT(*) GROUP BY)
• DISTINCT (SELECT DISTINCT)
Syntax is assignment statement-based:
• MyCusts = FILTER Custs BY SalesPerson eq 15;
Access Hbase
• CpuMetrics = LOAD 'hbase://SystemMetrics' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey -returnTuple');
61. Flume NG
Source
• Avro (data serialization system – can read json-encoded data files, and can
work over RPC)
• Exec (reads from stdout of long-running process)
Sinks
• HDFS, HBase, Avro
Channels
• Memory, JDBC, file
62. Flume NG (next generation)
Setup conf/flume.conf
# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory
# Define an Avro source called avro-source1 on agent1 and tell it
# to bind to 0.0.0.0:41414. Connect it to channel ch1.
agent1.sources.avro-source1.channels = ch1
agent1.sources.avro-source1.type = avro
agent1.sources.avro-source1.bind = 0.0.0.0
agent1.sources.avro-source1.port = 41414
# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
agent1.sinks.log-sink1.channel = ch1
agent1.sinks.log-sink1.type = logger
# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = avro-source1
agent1.sinks = log-sink1
From the command line:
flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1
63. Mahout Algorithms
Recommendation
• Your info + community info
• Give users/items/ratings; get user-user/item-item
• itemsimilarity
Classification/Categorization
• Drop into buckets
• Naïve Bayes, Complementary Naïve Bayes, Decision Forests
Clustering
• Like classification, but with categories unknown
• K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean-Shift
65. The Truth About Mahout
Mahout is really just an algorithm engine
Its output is almost unusable by non-statisticians/non-data scientists
You need a staff or a product to visualize, or make into a usable
prediction model
Investigate Predixion Software
• CTO, Jamie MacLennan, used to lead SQL Server Data Mining team
• Excel add-in can use Mahout remotely, visualize its output, run predictive
analyses
• Also integrates with SQL Server, Greenplum, MapReduce
• http://www.predixionsoftware.com
66. The “Data-Refinery” Idea
Use Hadoop to “on-board” unstructured data, then extract manageable
subsets
Load the subsets into conventional DW/BI servers and use familiar
analytics tool to examine
This is the current rationalization of Hadoop + BI tools‟ coexistence
Will it stay this way?
67. Dremel-based service for massive amounts of data
Pay for query and storage
SQL-like query language
Has an Excel connector
Google BigQuery
L
71. “Web Scale”
This the term used to justify NoSQL
Scenario is simple needs but “made up for in
volume”
• Millions of concurrent users
Think of sites like Amazon or Google
Think of non-transactional tasks like loading
catalog data to display product page, or
environment preferences
75. Consistency
CAP Theorem
• Databases may only excel at two of the following three attributes:
consistency, availability and partition tolerance
NoSQL does not offer “ACID” guarantees
• Atomicity, consistency, isolation and durability
Instead offers “eventual consistency”
Similar to DNS propagation
76. Things like inventory, account balances should be consistent
• Imagine updating a server in Seattle that stock was depleted
• Imagine not updating the server in NY
• Customer in NY goes to order 50 pieces of the item
• Order processed even though no stock
Things like catalog information don‟t have to be, at least not immediately
• If a new item is entered into the catalog, it‟s OK for some customers to see it
even before the other customers‟ server knows about it
But catalog info must come up quickly
• Therefore don‟t lock data in one location while waiting to update the other
Therefore, OK to sacrifice consistency for speed, in some cases
Consistency
78. Indexing
Most NoSQL databases are indexed by key
Some allow so-called “secondary” indexes
Often the primary key indexes are clustered
HBase uses HDFS (the Hadoop Distributed File System), which is
append-only
• Writes are logged
• Logged writes are batched
• File is re-created and sorted
79. Queries
Typically no query language
Instead, create procedural program
Sometimes SQL is supported
Sometimes MapReduce code is used…
80. MapReduce
This is not Hadoop‟s MapReduce, but it‟s conceptually related
Map step: pre-processes data
Reduce step: summarizes/aggregates data
Will show a MapReduce code sample for Mongo soon
Will demo map code on CouchDB
82. Sharding
A partitioning pattern where separate servers store partitions
Fan-out queries supported
Partitions may be duplicated, so replication also provided
• Good for disaster recovery
Since “shards” can be geographically distributed, sharding can act like a
CDN
Good for keeping data close to processing
• Reduces network traffic when MapReduce splitting takes place
84. Key-Value Stores
The most common; not necessarily the most popular
Has rows, each with something like a big dictionary/associative array
• Schema may differ from row to row
Common on cloud platforms
• e.g. Amazon SimpleDB, Azure Table Storage
MemcacheDB, Voldemort, Couchbase, DynamoDB
(AWS), Dynomite, Redis and Riak
87
85. Key-Value Stores
Table: Customers
Row ID: 101
First_Name: Andrew
Last_Name: Brust
Address: 123 Main Street
Last_Order: 1501
Row ID: 202
First_Name: Jane
Last_Name: Doe
Address: 321 Elm Street
Last_Order: 1502
Table: Orders
Row ID: 1501
Price: 300 USD
Item1: 52134
Item2: 24457
Row ID: 1502
Price: 2500 GBP
Item1: 98456
Item2: 59428
Database
86. Wide Column Stores
Has tables with declared column families
• Each column family has “columns” which are KV pairs that can vary from row to row
These are the most foundational for large sites
• BigTable (Google)
• HBase (Originally part of Yahoo-dominated Hadoop project)
• Cassandra (Facebook)
• Calls column families “super columns” and tables “super column families”
They are the most “Big Data”-ready
• Especially HBase + Hadoop
87. Table: Customers
Row ID: 101
Super Column: Name
Column: First_Name:
Andrew
Column: Last_Name: Brust
Super Column: Address
Column: Number: 123
Column: Street: Main Street
Super Column: Orders
Column: Last_Order: 1501
Table: Orders
Row ID: 1501
Super Column: Pricing
Column: Price: 300
USD
Super Column: Items
Column: Item1: 52134
Column: Item2: 24457
Row ID: 1502
Super Column: Pricing
Column: Price: 2500
GBP
Super Column: Items
Column: Item1: 98456
Column: Item2: 59428
Row ID: 202
Super Column: Name
Column: First_Name: Jane
Column: Last_Name: Doe
Super Column: Address
Column: Number: 321
Column: Street: Elm Street
Super Column: Orders
Column: Last_Order: 1502
Wide Column Stores
89. Document Stores
Have “databases,” which are akin to tables
Have “documents,” akin to rows
• Documents are typically JSON objects
• Each document has properties and values
• Values can be scalars, arrays, links to documents in other databases or sub-documents (i.e. contained
JSON objects - Allows for hierarchical storage)
• Can have attachments as well
Old versions are retained
• So Doc Stores work well for content management
Some view doc stores as specialized KV stores
Most popular with developers, startups, VCs
The biggies:
• CouchDB
• Derivatives
• MongoDB
90. Document Store Application Orientation
Documents can each be addressed by URIs
CouchDB supports full REST interface
Very geared towards JavaScript and JSON
• Documents are JSON objects
• CouchDB/MongoDB use JavaScript as native language
In CouchDB, “view functions” also have unique URIs and they return
HTML
• So you can build entire applications in the database
91. Database: Customers
Document ID: 101
First_Name: Andrew
Last_Name: Brust
Address:
Orders:
Database: Orders
Document ID: 1501
Price: 300 USD
Item1: 52134
Item2: 24457
Document ID: 1502
Price: 2500 GBP
Item1: 98456
Item2: 59428
Number: 123
Street: Main Street
Most_recent: 1501
Document ID: 202
First_Name: Jane
Last_Name: Doe
Address:
Orders:
Number: 321
Street: Elm Street
Most_recent: 1502
Document Stores
94. Graph Databases
Great for social network applications and others where relationships are
important
Nodes and edges
• Edge like a join
• Nodes like rows in a table
Nodes can also have properties and values
Neo4j is a popular graph db
95. Database
Sent invitation
to
Commented on
photo by
Friend
of
Address
Placed order
Item
2
Item
1
Joe Smith Jane
Doe
Andrew Brust
Street: 123 Main
Street
City: New York
State: NY
Zip: 10014
ID: 52134
Type: Dress
Color: Blue
ID: 24457
Type: Shirt
Color: Red
ID: 252
Total Price: 300
USD
George Washington
Graph Databases
96. NoSQL on Windows Azure
Platform as a Service
• Cloudant: https://cloudant.com/azure/
• MongoDB (via MongoLab): http://blog.mongolab.com/2012/10/azure/
MongoDB, DIY:
• On an Azure Worker Role:
http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+Worker+Roles
• On a Windows VM:
http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Windows+Installer
• On a Linux VM:
http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Linux+Tutorial
http://www.windowsazure.com/en-us/manage/linux/common-tasks/mongodb-on-a-linux-
vm/
98. NoSQL + BI
NoSQL databases are bad for ad hoc query and data warehousing
BI applications involve models; models rely on schema
Extract, transform and load (ETL) may be your friend
Wide-column stores, however are good for “Big Data”
• See next slide
Wide-column stores and column-oriented databases are similar
technologically
99. NoSQL + Big Data
Big Data and NoSQL are interrelated
Typically, Wide-Column stores used in Big Data scenarios
Prime example:
• HBase and Hadoop
Why?
• Lack of indexing not a problem
• Consistency not an issue
• Fast reads very important
• Distributed file systems important too
• Commodity hardware and disk assumptions also important
• Not Web scale but massive scale-out, so similar concerns
101. Common DBA Tasks in NoSQL
RDBMS NoSQL
Import Data Import Data
Setup Security Setup Security
Perform a Backup Make a copy of the data
Restore a Database Move a copy to a location
Create an Index Create an Index
Join Tables Together Run MapReduce
Schedule a Job Schedule a (Cron) Job
Run Database Maintenance Monitor space and resources used
Send an Email from SQL Server Set up resource threshold alerts
Search BOL Interpret Documentation
104
L
102. Which Type of NoSQL for
Which Type of Data?
Type of Data Type of NoSQL solution Example
Log files Wide Column HBase
Product Catalogs Key Value on disk DynamoDB
User profiles Key Value in memory Redis
Startups Document MongoDB
Social media connections Graph Neo4j
LOB w/Transactions NONE! Use RDBMS SQL Server
105
L
103. Relational vs. NoSQL
Line of Business -> Relational
Large, public (consumer)-facing sites -> NoSQL
Complex data structures -> Relational
Big Data -> NoSQL
Transactional -> Relational
Content Management -> NoSQL
Enterprise->Relational
Consumer Web -> NoSQL
105. Understand CAP & types of NoSQL databases
• Use NoSQL when business needs designate
• Use the right type of NoSQL for your business problem
Try out NoSQL on the cloud
• Quick and cheap for behavioral data
• Mashup cloud datasets
• Good for specialized use cases, i.e. dev, test , training environments
Learn NoSQL access technologies
• New query languages, i.e. MapReduce, R, Infer.NET
• New query tools (vendor-specific) – Google Refine, Amazon
Karmasphere, Microsoft Excel connectors, etc…
NoSQL To-Do List
L
106. NoSQL for .NET Developers
RavenDB
MongoDB C#/.NET Driver
MongoDB on Windows Azure
CouchBase .NET Client Library
Riak client for .NET
AWS Toolkit for Visual Studio
Google cloud APIs (REST-based)
http://nosql-database.org/http://hadoop.apache.org/ & http://www.mongodb.org/Wikipedia - http://en.wikipedia.org/wiki/NoSQLList of noSQL databases – http://nosql-database.org/The good, the bad - http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772
When the volume of data is too much for simple human interpretation ->Man PLUS Machine (Data Mining / Statistics)
About Data Science -- http://www.romymisra.com/the-new-job-market-rulers-data-scientists/R language - http://www.r-project.org/Infer.NET - http://research.microsoft.com/en-us/um/cambridge/projects/infernet/There are a plethora of languages to access, manipulate and process bigData. These languages fall into a couple of categories:RESTful – simple, standardsETL – Pig (Hadoop) is an exampleQuery – Hive (again Hadoop), lots of *QLAnalyze – R, Mahout, Infer.NET, DMX, etc.. Applying statistical (data-mining) algorithms to the data output