NoSQL databases are currently used in several applications scenarios in contrast to Relations Databases. Several type of Databases there exist. In this presentation we compare Key Value, Column Oriented, Document Oriented and Graph Databases. Using a simple case study there are evaluated pros and cons of the NoSQL databases taken into account.
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
Rahul Singh of Anant Corporation covers the three common problems in Datastax / Cassandra operations which stem from Data Modeling and outlines strategies and best practices to deal with them.
NoSQL databases are currently used in several applications scenarios in contrast to Relations Databases. Several type of Databases there exist. In this presentation we compare Key Value, Column Oriented, Document Oriented and Graph Databases. Using a simple case study there are evaluated pros and cons of the NoSQL databases taken into account.
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
Rahul Singh of Anant Corporation covers the three common problems in Datastax / Cassandra operations which stem from Data Modeling and outlines strategies and best practices to deal with them.
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses.
https://www.bigdataspain.org/2017/talk/fishing-graphs-in-a-hadoop-data-lake
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseRidwan Fadjar
This slide was created to present the result of my paper about "A Study Review of Common Big Data Architecture for Small-Medium Enterprise" at MSCEIS FPMIPA Universitas Pendidikan Indonesia 2019.
In cooperate with: https://www.linkedin.com/in/faijinali and https://www.linkedin.com/in/fajriabdillah
Peter Marshall, Technology Evangelist at Imply
Abstract: Apache Druid® can revolutionise business decision-making with a view of the freshest of fresh data in web, mobile, desktop, and data science notebooks. In this talk, we look at key activities to integrate into Apache Druid POCs, discussing common hurdles and signposting to important information.
Bio: Peter Marshall (https://petermarshall.io) is an Apache Druid Technology Evangelist at Imply (http://imply.io/), a company founded by original developers of Apache Druid. He has 20 years architecture experience in CRM, EDRM, ERP, EIP, Digital Services, Security, BI, Analytics, and MDM. He is TOGAF certified and has a BA degree in Theology and Computer Studies from the University of Birmingham in the United Kingdom.
View the Big Data Technology Stack in a nutshell. This Big Data Technology Stack deck covers the different layers of the Big Data world and summarizes the major technologies in vogue today.
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...Nicolas Kourtellis
A general overview of the APACHE SAMOA platform for mining big data streams using machine learning algorithms running on distributed stream processing platforms such as Apache STORM, Apache Flink, Apache Samza and Apache Apex.
Results are shown from experimentation with VHT, the Vertical Hoeffding Tree proposed in "VHT: Vertical Hoeffding Tree." N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Mordupo. IEEE BigData 2016.
Presentation in APACHE BIG DATA North America 2016
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses.
https://www.bigdataspain.org/2017/talk/fishing-graphs-in-a-hadoop-data-lake
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseRidwan Fadjar
This slide was created to present the result of my paper about "A Study Review of Common Big Data Architecture for Small-Medium Enterprise" at MSCEIS FPMIPA Universitas Pendidikan Indonesia 2019.
In cooperate with: https://www.linkedin.com/in/faijinali and https://www.linkedin.com/in/fajriabdillah
Peter Marshall, Technology Evangelist at Imply
Abstract: Apache Druid® can revolutionise business decision-making with a view of the freshest of fresh data in web, mobile, desktop, and data science notebooks. In this talk, we look at key activities to integrate into Apache Druid POCs, discussing common hurdles and signposting to important information.
Bio: Peter Marshall (https://petermarshall.io) is an Apache Druid Technology Evangelist at Imply (http://imply.io/), a company founded by original developers of Apache Druid. He has 20 years architecture experience in CRM, EDRM, ERP, EIP, Digital Services, Security, BI, Analytics, and MDM. He is TOGAF certified and has a BA degree in Theology and Computer Studies from the University of Birmingham in the United Kingdom.
View the Big Data Technology Stack in a nutshell. This Big Data Technology Stack deck covers the different layers of the Big Data world and summarizes the major technologies in vogue today.
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...Nicolas Kourtellis
A general overview of the APACHE SAMOA platform for mining big data streams using machine learning algorithms running on distributed stream processing platforms such as Apache STORM, Apache Flink, Apache Samza and Apache Apex.
Results are shown from experimentation with VHT, the Vertical Hoeffding Tree proposed in "VHT: Vertical Hoeffding Tree." N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Mordupo. IEEE BigData 2016.
Presentation in APACHE BIG DATA North America 2016
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.
Businesses can leverage Big Data to enhance market share and ROI with valuable insights, accelerated time-to-market and new revenue streams. Learn how TCS transforms business by leveraging Big Data for accurate business insights
QAT Global is a global information technology (IT) services company providing Agile-based software development, IT consulting, technology and distributed development services. We pride ourselves in being a leader in the delivery of enterprise business solutions through the innovative use of technologies such as Enterprise Java and .NET as well as Open Source components.
QAT Global focuses on delivering business results by helping clients find ways to capitalize on change, leverage emerging technologies effectively, and out innovate competitors through collaborative engagements. The company leverages an enhanced global delivery model, innovative enterprise development framework for distributed environments, repeatable process methodology based in Agile and Scrum, multimedia communication tools, and deep industry expertise to provide high-value IT services. This approach enables its clients to improve their end user’s experience, expand market reach, improve time to market, and reduce operating costs and risks.
QAT Global serves government agencies, companies ranging from early stage startups to Global 2000 companies, and leading software vendors in Banking & Financial Services, Transportation, Insurance, Manufacturing, Utilities, Telecommunications, Information & Entertainment industries, Human Resource Management, Benefits Administration, Government, E-Commerce, and Communications & Technology.
QAT Global has extensive experience and in-depth expertise in application modernization, Business Process Management, rich internet applications, and distributed software development. The company’s service offerings include technology consulting, custom software application development and maintenance, software product engineering, systems integration, application modernization, web and mobile application development, big data and analytics, and testing services.
Founded in 1995, and headquartered in Omaha, Nebraska, QAT Global has operations in the United States and Brazil.
Verticals - Banking & Financial Services, Transportation, Insurance, Manufacturing, Utilities, Telecommunications, Software Publishing, Information & Entertainment industries, Human Resource Management, Benefits Administration, Government, E-Commerce & Ebusiness, and Communications & Technology
Clients - QAT Global serves companies ranging from early stage startups to Global 2000 companies and leading software vendors.
Offices - QAT Global is headquartered in Omaha, Nebraska. The QAT Global offshore development center is located in Uberaba, MG, Brazil.
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
An overview of the history of Big Data, followed by a deep dive into the Hadoop ecosystem. Detailed explanation of how HDFS, MapReduce, and HBase work, followed by a discussion of how to tune HBase performance. Finally, a look at industry trends, including challenges faced and being solved by Bloomberg for using Hadoop for financial data.
BigData HUB is a non-profit organization that help to spread Big Data and Data Science technology around Egyptian universities and Globally.
https://www.facebook.com/BigDataHub
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Desing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the basics of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
Marlabs Capabilities Overview: DWBI, Analytics and Big Data ServicesMarlabs
Marlabs’ Business Intelligence and Analytics practice can support customers’ needs throughout the information management lifecycle. As a vendor-agnostic and holistic service provider with expertise in a range of tools and technologies, we can help clients make informed decisions to employ the right technologies that align with their business needs.
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...Usama Fayyad
BigData in financial services and banking - a view from the on-line advanced analytics with case studies from Yahoo! and others. This is a shortened presentation, and the longer version available. Includes commentary on Hadoop and Map-Reduce grid and where appropriate to use.
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
Talk held at the Java User Group on 05.09.2013 in Novi Sad, Serbia
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
The need to process huge data is increasing day by day. Processing huge data involves compute, network and storage. In terms of Big Data, What it takes to innovate and what is innovation at the end? This talk provide high level details on the need of big data and capabilities of Mapr converged data platform.
Speaker: Vijaya Saradhi Uppaluri, Technical Director at MapR Technologies
PRACE Autumn school 2021 - Big Data with Hadoop and Keras
27-30 September 2021
Fakulteta za strojništvo
Europe/Ljubljana
Data and scripts are available at: https://www.events.prace-ri.eu/event/1226/timetable/
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
2. Agenda
1. Introduction
•
The history, the #BigData, a bit of theory behind…
2. What is Hadoop, part 1
•
Introducing HDFS and Map/Reduce
3. What is Hadoop, part 2
1.
The next generation (v. 2.x), Real time, …
4. Microsoft and Big Data
1.
Lambda architecture and Windows Azure, WA
Storage(s), WA HDInsight
5. Q&A
3. Who am I?
(Who bothers? )
Stefano Paluello
•
•
•
Tech Lead @ SG Gaming
All around geek, passionate about
architecture, Cloud and Data
Co-founder of various start-up(s)
7. history
• 2002
• 2003
• 2004
Hadoop, created by Doug Cutting (part of
the Lucene project), starts as an Open
Source search engine for the Web. It has
its origins in Apache Nutch, parts of the
Lucene project (full text search engine).
Google publishes a paper describing its
own distributed file system, also called
GFS.
The first version of NDFS, Nutch
Distributed FS, implementing the
Google’s paper.
8. history
• 2004
Google publishes, another, paper
introducing the MapReduce algorithm
• 2005
The first version of MapReduce is
implemented in Nutch
• 2005 (end)
• 2006
(Feb)
Nutch’s MapReduce is running on NDFS
Nutch’s MapReduce and NDFS became
the core of a new Lucene’s subproject:
9. history
• 2008
Yahoo launches the World’s largest
Hadoop PRODUCTION site
Some Webmap size data:
• # of links between pages in the index:
roughly 1 trillion (1012) links
• Size of the output:
over 300 TB, compressed (!!!)
• # of cores to run a single MapReduce job:
over 10000
• Raw disk used in the production cluster:
Over 5 Petabytes
14. What is #BigData?
BigData is a definition, but for someone is a
buzzword (a keyword with no or not precise
meaning but sounding interesting) that is trying
to address all this “new” (really?!?) needing of
processing a lot of data.
To identify we usually use the “Three V” to
define BigData
15. The 3 V’s of #BigData?
Volume: the size of the data that we’re dealing with
Variety: the data is coming from a lot of different
sources
Velocity: the speed at which the data is generated
19. #BigData
It is predicted that between 2009 and 2020, the estimated
size of the “digital universe” will grow around 35
Zettabytes (270 bytes) per year (!!!)
1 Zettabyte = 1k Exabyte or 1M Petabyte or 1G Terabyte
Font: www.wipro.com, July 2012
#BigData market and analysis and the 3Vs definition, was
introduced by a Gartner research about 13 years ago
http://blogs.gartner.com/doug-laney/deja-vvvue-others-claiminggartners-volume-velocity-variety-construct-for-big-data/
23. Lambda Architecture
Solves the problem of computing
arbitrary functions on arbitrary data by
decomposing the problem in three layer:
The batch layer
The serving layer
The speed layer
24. The Batch layer
Stores all the data in an immutable, constantly
growing dataset
Accessing all the data is too expensive (even if
possible)
Precompute “query” functions are created (aka
“batch view”, high latency operations) allowing
the results to be accessed quickly
26. The Serving layer
Indexes the batch views
Loads the batch views and allows to access and
query them efficiently
Usually is a distributed database that loads in the
batch views and it’s updated by the batch layer
It requires batch updates and random reads but
does NOT require random writes.
27. The Speed layer
Compensate for high latency updates of the
serving layer
Provides fast incremental algorithms
Updates the realtime view while receiving new
data, without computing them like the batch
layer)
32. ACID
ACID is a set of properties that guarantee that database
transactions are processed reliability
[ Source: Wikipedia ]
Atomicity: or “all or nothing”. All the modification in a
transaction must happen successfully or no changes are committed
Consistency: all my data will be always in a valid state after every
transactions.
Isolation: transactions are isolated, so any transaction is
separated and won’t affect the data of other transactions
Durability: once a transaction is committed, the related data are
safely and durably stored, regardless to errors, crashes or any
software malfunctions
33. CAP
CAP theorem (or Brewer’s theorem) is a set of basic
requirements that describes a distributed system
Consistency: all the server in the system will have the same data
Availability: all the server in the system will be available and they
will return all the data available (also if they could be not consistent
across the system)
Partition (tolerance): the system will continues to operate as a
whole despite arbitrary message loss or failure of a part of the
system
According to the theorem, a distributed system CANNOT satisfy all
the three requirements at the SAME time (“two out of three”
concept).
36. Hadoop…
Where it comes from?
The “legend” says that the name comes from
Doug Cutting (one of the founder of the project)
son’s toy elephant. So it is also the logo of the
yellow smiling elephant.
37. Hadoop cluster
A Hadoop cluster consist in mainly two
modules:
A way to store distributed data, the HDFS or
Hadoop Distributed File System (storage layer)
A way to process data, the MapReduce (compute
layer)
This is the core of Hadoop!
38. HDFS
The Hadoop Distributed File System
For a developer point of view it looks like a standard file
system
Runs on top of OS file system (extf3,…)
Designed to store a very large amount of data (petabytes
and so on) and to solve some problems that comes with
DFS e NFS
Provides fast and scalable access to the data
Stores data reliably
41. HDFS under the hood
All the files loaded in Hadoop are split into chunks, called
blocks. Each block has a fixed size of 64Mb (!!!). Yes,
Megabytes!
MyData – 150Mb
Blk_01
64Mb
HDFS
Blk_02
64Mb
Blk_03, 22Mb
42. Datanode(s) and Namenode
Datanode is a daemon (a service in the Windows language)
running on each cluster nodes, that is responsible to store
the blocks
Namenode, is a dedicated node where all the metadata of all
the files (blocks) inside my system are stored. It’s the
directory manager of the HDFS
To access a file, a client contact the Namenode to retrieve
the list of locations for the blocks. With the locations the
client contact the Datanodes to read the data (possibly in
parallel).
43. Data Redundancy
Hadoop replicates each block THREE times, as it’s stored in
the HDFS.
The location of every blocks is managed by the Namenode
If a block is under-replicated (due to some failures on a
node), the Namenode is smart enough to create another
replica, until each node has three replica inside the cluster
Yes… you made your homework! If I have 100Tb of data to
store in Hadoop, I will need 300Tb of storage space.
45. Namenode availability
If the Namenode fails ALL the cluster becomes inaccessible
In the early versions the Namenode was a single point of
failure
Couple of solution are now available:
the Namenode stores the data on the network through
NFS
most production sites have two Namenode: Active and
Standby
46. HDFS Quick Reference
The HDFS are pretty easy to use and to remember (specially
if you come from a *nix like environment
The commands usually have the “hadoop fs” prefix
To list the content of a HDFS folder
> hadoop fs –ls
To load a file in the HDFS
> hadoop fs –put <file>
To read a file loaded into HDFS
> hadoop fs –tail <file>
And so on…
>hadoop fs –mkdir <dir>
>hadoop fs –mv <sourcefile> <destfile>
>hadoop fs –rm <file>
49. MapReduce
Processing large file serially could be a problem.
MapReduce is designed to be a very parallelized way
of managing data
Data are split into many pieces
Each piece is processed simultaneously and isolated
Data are processed in isolation by tasks called Mappers.
The result of the Mappers, is then brought together (with
a process called “Shuffle and Sort”) into a second set of
tasks, Reducers.
55. Using Hadoop Streaming
Hadoop Streaming allows to write Mappers and
Reducers in almost any language, rather than forcing
you to use Java
The command to run the streaming it’s a bit “tricky”
56.
57. MapReduce on a “real” case
Retailer with many stores around the country
The data are written on a sequential log with date,
store location, item, price, payment
2014-01-01
2014-0101
….
London
NewCastle
Clothes 13.99£ Card
Music 05.69£ Bank
A really simple mapper will simply split all the data
and then pass them to a mapper
The mapper will calculate the Sales Total split for
every location
61. Hadoop related projects
PIG: high level language fro analyzing large data-sets. It’s
working as a compiler that produce M/R jobs
HIVE: data warehouse software facilities querying and
managing large data-sets with a SQL like language
Hbase : a scalable, distributed database that supports
structured data storage for large tables
Cassandra: a scalable multi-master database
63. Hadoop v 2.x
Hadoop is a pretty easy system to use, but a bit tricky
to set-up and manage
The skills required are more related to System
Management than the Dev side
Let’s add that the Apache documentation never
stood up for clarification and completeness
So, to add a bit of mess, they decided to make the v2,
that is actually changing a lot
64. Hadoop v 2.x
The new Hadoop has now FOUR modules (instead of
two)
HadoopCommon: common utilities supporting all the
other modules
HDFS: an evolution of the previous distributed FS
Hadoop YARN: a fx for job scheduling and cluster
resource management
Hadoop MapReduce: a YARN based system for paralllel
processing of large data sets
65. Hadoop v 2.x
Hadoop v2, leveraging YARN, is aiming to become the
new OS for the data processing
66. Hadoop and real time
Hadoop v2, using YARN, and Storm (a free and open
source distributed real time computation system) can
compute your data in real time
Some Hadoop distribution (like Hortonworks) are
working on an effortless integration
http://hortonworks.com/blog/stream-processing-inhadoop-yarn-storm-and-the-hortonworks-data-platform/
68. Microsoft Lambda Architecture support
Batch Layer
• WA HDInsight
• WA Blob storage
• MapReduce, Hive,
Pig,…
Speed Layer
• Federation in WA SQL
DB
• Azure Tables
• Memchached/Mongo
DB
• SQL Azure
• Reactive Extensions
(Rx)
Serving Layer
• Azure Storage
Explorer
• MS Excel (and Office
suite)
• Reporting Services
• Linq To Hive
• Analysis Services
69. Yahoo, Hadoop and SQL Server
Apache Hadoop
Staging Database
SQL Server Analysis Service (SSAS
Microsoft Excel and PowerPivot
Other BI Tools and Custom
Applications
SQL Server Connector (Hadoop Hive ODBC)
SQL Server
Analysis Services
(SSAS Cube)
Hadoop Data
Third Party
Database
+
Custom
Applications
70. MS. Net SDK for Hadoop
• .NET client libraries for
Hadoop
• Write MapReduce in Visual
Studio using C# or F#
• Debug against local data
Slave Nodes
71. WebClient Libraries in .Net
• WebHDFS client
library: works with files
in HDFS and Windows
Azure Blob storage
• WebHCat client library:
manages the
scheduling and
execution of jobs in an
HDInsight cluster
WebHDFS
WebHCat
• Scalable REST API
• HDInsight
job
scheduling
and
execution
• Move files in and
out and delete
from HDFS
• Perform file and
directory
functions
72. Reactive Extensions (Rx):
Pulling vs. Pushing Data
Interactive vs Reactive
• In interactive programming, pulling data
from a sequence that represents the
source (IEnumerator)
• In reactive programming, subscribing to a
data stream (called observable sequence in
Rx), with updates handed to it from the
source
73. Reactive Extensions (Rx): Pulling
vs. Pushing Data
Application
Move Next
On Next
Got next?
Interactive
Reactive
IEnumerable<T>
IEnumerator<T>
Have next!
IObservable<T>
IObserver<T>
Editor's Notes
Example of data to storetransactions (financial, government related)logs (records of activity, location)business data (product catalogs, prices, customers)user data (images, documents, video)sensor data (temperature, pollution)medical data (x-rays, brain activity records)social (email, twitter etc)
Lambda architecture: “community” driven architecture providing a way for different BigData components to work together
The batch layer run on a while(true) and recomputes the batch view from scratchIt’s quite simple to implement
The speed layer will maintain the same key of the batch layer, so it will be able to recognize and select the same data.The different is that this layer will modify the data will receiving the data.
RECAP…Usually the Batch Layer is implemented with HDFS – Hadoop Distributed File SystemServing Database : ElephantDB, Hbase…Speed Layer: Cassandra (map with a sorted map as a value), or Cassandra with Storm (stream access), or in memory DB
RECAP…
Exmaple:In the cloud, on an elastic first level system, the service should be “stateless” or at least “soft-state” (cached) and must always response to the query, even if the backend is down. So the system will be “A”, immediate responsive, and “P”, regardless a failure in the backend the system is responding to the requests
Using SQL Server 2008 R2, Yahoo! enhanced its Targeting, Analytics and Optimization (TAO) infrastructure Key Points: With Big Data technology, Yahoo experienced the following benefits:Improved ad campaign effectiveness and increased advertiser spending.Cube producing 24 terabytes of data quarterly, making it the world’s largest SQL Server Analysis Services cube.Ability to handle more than 3.5 billion daily ad impressions, with hourly refresh rates.References: Microsoft case study: Yahoo! Improves Campaign Effectiveness, Boosts Ad Revenue with Big Data Solution: http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=710000001707