Engineering your cloud infrastructure using CHEF. This presentation was given as part of my application to the University of Ottawa for a role as a tenure track professor in the Faculty of Engineering. The focus was about using CHEF for infrastructure as code, with a small tangent discussion a MapReduce example. This presentation is partially in English and French.
Content and talk by Giovani Lanzani (GoDataDriven) at SEA Amsterdam in November 2014. Real time data driven applications using Python and pandas as backend
Flink Forward Berlin 2017: Max Kiessling, Martin Junghanns - Cypher-based Gra...Flink Forward
Graph pattern matching is one of the most interesting and challenging operations on graph data. However, it is primarily supported by graph database systems such as Neo4j but not generally available for distributed processing frameworks like Apache Flink or Apache Spark. In our talk, we want to give an overview of our current implementation of Cypher on Apache Flink. Cypher is the Neo4j graph query language and enables the intuitive definition of graph patterns including structural and semantic predicates. As the Neo4j graph data model is not supported out-of-the box by Apache Flink, we leverage Gradoop, a Flink-based graph analytics framework that already provides an abstraction of schema-free property graphs. We will give a brief overview about the technologies used to implement Cypher, explain our query engine and give a demonstration of the available language features. In addition, we will present benchmark results from running Cypher queries on billion edge graphs.
Engineering your cloud infrastructure using CHEF. This presentation was given as part of my application to the University of Ottawa for a role as a tenure track professor in the Faculty of Engineering. The focus was about using CHEF for infrastructure as code, with a small tangent discussion a MapReduce example. This presentation is partially in English and French.
Content and talk by Giovani Lanzani (GoDataDriven) at SEA Amsterdam in November 2014. Real time data driven applications using Python and pandas as backend
Flink Forward Berlin 2017: Max Kiessling, Martin Junghanns - Cypher-based Gra...Flink Forward
Graph pattern matching is one of the most interesting and challenging operations on graph data. However, it is primarily supported by graph database systems such as Neo4j but not generally available for distributed processing frameworks like Apache Flink or Apache Spark. In our talk, we want to give an overview of our current implementation of Cypher on Apache Flink. Cypher is the Neo4j graph query language and enables the intuitive definition of graph patterns including structural and semantic predicates. As the Neo4j graph data model is not supported out-of-the box by Apache Flink, we leverage Gradoop, a Flink-based graph analytics framework that already provides an abstraction of schema-free property graphs. We will give a brief overview about the technologies used to implement Cypher, explain our query engine and give a demonstration of the available language features. In addition, we will present benchmark results from running Cypher queries on billion edge graphs.
This presentation is designed to provide a basic overview of the following: System Organization, Memory Organization, Stack Organization (For Function Calls), A Vulnerable C Program, Exploiting Buffer Overflow
The Terror-Free Guide to Introducing Functional Scala at WorkJorge Vásquez
Too often, our applications are dominated by boilerplate that's not fun to write or test, and that makes our business logic complicated. In object-oriented programming, classes and interfaces help us with abstraction to reduce boilerplate. But, in functional programming, we use type classes.
Historically, type classes in functional programming have been very complex and confusing, partially because they import ideas from Haskell that don't make sense in Scala, and partially because of their esoteric origins in category theory.
In this presentation, Jorge Vásquez presents a new library called ZIO Prelude, which offers a distinctly Scala take on Functional Abstractions, and you will learn how you can eliminate common types of boilerplate by using it.
Come see how you can improve your happiness and productivity with a new take on what it means to do functional programming in Scala!
This talk is about using Hive in practice. We will go through some of the specific use cases for which Hive is currently being used at Last.fm, highlighting its strengths and weaknesses along the way.
This is part of an introductory course on Big Data Tools for Artificial Intelligence. These slides introduce students to the new in-memory cluster computing named Spark.
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Data Con LA
"At OpenX we not only use the tools in big data ecosystems to solve our business problems, but also explore the cutting edge algorithms for practical uses. HyperLogLog is one of the algorithm that we use intensively in our internal system. It has really low computation cost and can easily plug into map-reduce framework (hadoop or spark). Some of the applications that worth to highlight are:
* high cardinality test
* distinct count of unique users over time
* Visualize hyperloglog for fraud detection"
Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.
Social Data Analytics using IBM Big Data TechnologiesNicolas Morales
Distilling Insights from Social Media Using Big Data Technologies
Have you ever wondered what your customers are saying about you in Social media, and the impact it might be having on your business? This session will focus on how BigInsights and Big Data technologies can be used to glean useful and actionable insights from social media data.
You'll see how data can be ingested and prepped and do text analytics on social data in real time. Using Hadoop, we'll show you how you can store and analyze your large volume of historical social media data and reference data. This talk and demo will provide an introduction to text analytics and how it is used within the IBM Big Data platform for a social media solution.
This presentation is designed to provide a basic overview of the following: System Organization, Memory Organization, Stack Organization (For Function Calls), A Vulnerable C Program, Exploiting Buffer Overflow
The Terror-Free Guide to Introducing Functional Scala at WorkJorge Vásquez
Too often, our applications are dominated by boilerplate that's not fun to write or test, and that makes our business logic complicated. In object-oriented programming, classes and interfaces help us with abstraction to reduce boilerplate. But, in functional programming, we use type classes.
Historically, type classes in functional programming have been very complex and confusing, partially because they import ideas from Haskell that don't make sense in Scala, and partially because of their esoteric origins in category theory.
In this presentation, Jorge Vásquez presents a new library called ZIO Prelude, which offers a distinctly Scala take on Functional Abstractions, and you will learn how you can eliminate common types of boilerplate by using it.
Come see how you can improve your happiness and productivity with a new take on what it means to do functional programming in Scala!
This talk is about using Hive in practice. We will go through some of the specific use cases for which Hive is currently being used at Last.fm, highlighting its strengths and weaknesses along the way.
This is part of an introductory course on Big Data Tools for Artificial Intelligence. These slides introduce students to the new in-memory cluster computing named Spark.
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Data Con LA
"At OpenX we not only use the tools in big data ecosystems to solve our business problems, but also explore the cutting edge algorithms for practical uses. HyperLogLog is one of the algorithm that we use intensively in our internal system. It has really low computation cost and can easily plug into map-reduce framework (hadoop or spark). Some of the applications that worth to highlight are:
* high cardinality test
* distinct count of unique users over time
* Visualize hyperloglog for fraud detection"
Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.
Social Data Analytics using IBM Big Data TechnologiesNicolas Morales
Distilling Insights from Social Media Using Big Data Technologies
Have you ever wondered what your customers are saying about you in Social media, and the impact it might be having on your business? This session will focus on how BigInsights and Big Data technologies can be used to glean useful and actionable insights from social media data.
You'll see how data can be ingested and prepped and do text analytics on social data in real time. Using Hadoop, we'll show you how you can store and analyze your large volume of historical social media data and reference data. This talk and demo will provide an introduction to text analytics and how it is used within the IBM Big Data platform for a social media solution.
Boris Lublinsky and Alexey Yakubovich give us an overview of using Oozie. This presentation was given on December 13th, 2012 at the Nokia offices in Chicago, IL.
View the HD video of this talk here: http://vimeo.com/chug/oozie-overview
Exploring how to manage data process pipelines on Hadoop. Without using cron. This talk explores the details, and pit-holes, of Apache Oozie and Falcon.
This deck presents the best practices of using Apache Hive with good performance. It covers getting data into Hive, using ORC file format, getting good layout into partitions and files based on query patterns, execution using Tez and YARN queues, memory configuration, and debugging common query performance issues. It also describes Hive Bucketing and reading Hive Explain query plans.
You can now download the presentation directly from Slideshare.
*Disclaimer this is just my imaginary example of a Comms Plan for the Puma work and not the actual strategy that was created by Droga5 for Puma. I had nothing to do with that plan and am just a fan of their work.
What is Comms Planning? is a presentation that provides a clear answer of the role of the Comms Planner within an Advertising Agency. I use the example of the Puma Social campaign to prove the point.
Storage and computation is getting cheaper AND easily accessible on demand in the cloud. We now collect and store some really large data sets Eg: user activity logs, genome sequencing, sensory data etc. Hadoop and the ecosystem of projects built around it present simple and easy to use tools for storing and analyzing such large data collections on commodity hardware.
Topics Covered
* The Hadoop architecture.
* Thinking in MapReduce.
* Run some sample MapReduce Jobs (using Hadoop Streaming).
* Introduce PigLatin, a easy to use data processing language.
Speaker Profile: Mahesh Reddy is an Entrepreneur, chasing dreams. Works on large scale crawl and extraction of structured data from the web. He is a graduate frm IIT Kanpur(2000-05) and previously worked at Yahoo! Labs as Research Engineer/Tech Lead on Search and Advertising products.
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
Multi-objective evolutionary algorithms (MOEAs) help software engineers find novel solutions to complex problems. When automatic tools explore too many options, they are slow to use and hard to comprehend. GALE is a near-linear time MOEA that builds a piecewise approximation to the surface of best solutions along the Pareto frontier. For each piece, GALE mutates solutions towards the better end. In numerous case studies, GALE finds comparable solutions to standard methods (NSGA-II, SPEA2) using far fewer evaluations (e.g. 20 evaluations, not 1,000). GALE is recommended when a model is expensive to evaluate, or when some audience needs to browse and understand how an MOEA has made its conclusions.
Cloud Computing course presentation, Tarbiat Modares University
By: Sina Ebrahimi, Mohammadreza Noei
Advisor: Sadegh Dorri Nogoorani, PhD.
Presentation Data: 1397/03/07
Video Link in Aparat: https://www.aparat.com/v/N5VbK
Video Link on TMU Cloud: http://cloud.modares.ac.ir/public.php?service=files&t=9ecb8d2dd08df6f990a3eb63f42011f7
This presenation's pptx file (some animations may be lost in slideshare) : http://cloud.modares.ac.ir/public.php?service=files&t=f62282dbd205abaa66de2512d9fdfc83
OLAP Basics and Fundamentals by Bharat Kalia Bharat Kalia
OLAP is a category of software technology that enables analysts, managers, and executives to gain insight into the data through fast, consistent, interactive, access in a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user.
What is Distributed Computing, Why we use Apache SparkAndy Petrella
In this talk we introduce the notion of distributed computing then we tackle the Spark advantages.
The Spark core content is very tiny because the whole explanation has been done live using a Spark Notebook (https://github.com/andypetrella/spark-notebook/blob/geek/conf/notebooks/Geek.snb).
This talk has been given together by @xtordoir and myself at the University of Liège, Belgium.
This document introduces the theory behind geological database processes and provides detailed
examples using the geological database modelling functions in Surpac. By working through this
tutorial you will gain skills in the creation, use, and modification of geological databases.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
2. 2
Hadoop: In this Presentation
1. Introduction
2. Origins
3. MapReduce
4. Hadoop as MapReduce Implementation
5. Data Warehouse on Hadoop
6. Hadoop and Data Warehousing
7. Conclusions
3. 3
Why?
• Lot of Data
• How to deal with it?
• Hadoop to rescue!
• When to use?
• When not to use?
• Curiosity
4. 4
MapReduce: Origins
• Functional Programming
• High order functions to operate on lists
• mp
a
• apply to each element of the list
• rdc = fl = acmlt
eue
od
cuuae
• aggregate a list and produce one value of output
• No side effects
5. 5
MapReduce: Origins
• (eie(1e)( e 1)
dfn + l + l )
•
(a + (it123)
mp 1 ls
)
•
(eue+0(it234)
rdc
ls
)
•
(eue+0(a + (it123)
rdc
mp 1 ls
))
(it234
ls
)
9
9
⇒
⇒
⇒
6. 6
MapReduce: Origins
• These function do not have side effects
• And can be parallelized easily
• Can split the input data into chunks:
⇒
• (it1234
ls
)
( i t 1 2 and ( i t 3 4
ls
)
ls
)
• Apply map to each chuck separately, and then combine ( r d c them
e u e)
together
7. 7
MapReduce: Origins
• Mapping separately:
•
(eiers (eue+0(a + (it12)
dfn e1 rdc
mp 1 ls
))
•
(eue+rs (a + (it34)
rdc
e1 mp 1 ls
))
• This is the same as ( e u e + 0 ( a + ( i t 1 2 3 4 )
rdc
mp 1 ls
))
• Note that for r d c the function must be additive
eue
8. 8
MapReduce
• A m p function
a
• takes a key-value pair ( n k y i _ a )
i_e, nvl
• produces zero or more key-value pairs: intermediate results
• intermediate results are grouped by key
• A r d c function
eue
• for each group in the intermediate results
• aggregates and produces the final output
9. 9
MapReduce Stages
each MapReduce Job is executed in 3 stages
• map stage: apply m p to each key-value pair
a
• group together the intermediate results by key
• reduce stage: apply r d c to each group
eue
11. 11
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis
sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem
nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per
conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris
mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus.
Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros.
Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat
egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod
massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis
fringilla dolor ornare mi dictum ornare.
12. 12
MapReduce Example
0 .d f m p S r n i p t k y S r n d c :
1 e a(tig nu_e, tig o)
0.
2
0.
3
frec wr wi dc
o ah od
n o:
EiItreit w 1
m t n e m d a e( , )
0 .d f r d c ( t i g o t u _ e , I e a o o t u _ a s :
4 e eueSrn uptky trtr uptvl)
0.
5
itrs=0
n e
0.
6
frec vi otu_as
o ah
n uptvl:
0.
7
rs+ v
e =
0.
8
Ei rs
m t( e )
13. 13
MapReduce Example
w
)1 ,w(
• reduce stage: for each
pairs into
)]1 , . . . ,1 ,1[ ,w(
• group a list of
w
• map stage: output 1 for each word
calculate how many ones there are
16. 16
“
Hadoop
... is a framework that allows for the distributed processing of large data
sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to
deliver high-availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-available service on
top of a cluster of computers, each of which may be prone to failures.
17. 17
Hadoop
• Open Source implementation of MapReduce
• "Hadoop":
• HDFS
• Hadoop MapReduce
• HBase
• Hive
• ... many others
18. 18
Hadoop Cluster: Terminology
• Name Node: orchestrates the process
• Workers: nodes that do the computation
• Mappers do the map phase
• Reducers do the reduce phase
28. 28
Advantages
• Simple, especially for programmers who know FP
• Fault tolerant
• No schema, can process any data
• Flexible
• Cheap and runs on commodity hardware
29. 29
Disadvantages
• No declarative high-level language like SQL
• Performance issues:
• Map and Reduce are blocking
• Name Node: single point of failure
• It's young
34. 34
Cheetah
• Virtual views consist of columns that can be queried
• Everything inside is entirely denormalized
• Append-only design and slowly changing dimensions
• Proprietary
35. 35
Hive
• A data warehousing solution built by Facebook
• For Big data analysis:
• in 2010 (4 years ago!), 30+ PB
• Has its own data model
• HiveQL: a declarative SQL-like language for ad-hoc querying
36. 36
HiveQL
Tables
0 .S A U U D T ( s r i i t s a u s r n , d s r n )
1 TTS PAEue d n, tts tig s tig
0 .P O I E ( s r d i t s h o s r n , g n e i t
2 RFLSuei n, col tig edr n)
0 .L A D T L C L I P T ' o s s a u _ p a e '
1 OD AA OA NAH lg/ttsudts
0 .I T T B E s a u _ p a e
2 NO AL ttsudts
0 .P R I I N ( s ' 0 9 0 - 0 )
3 ATTO d=20-32'
37. 37
HiveQL
0 .F O
1 RM
0 .( E E T a s a u , b s h o , g g n e
2 SLC .tts .col .edr
0. FO sau_pae aJI poie b
3
RM ttsudts
ON rfls
0. O (.srd=buei adad ='090-0)sb1
4
N auei
.srd n .s
20-32' uq
0 .I S R O E W I E T B E g n e _ u m r
5 NET VRRT AL edrsmay
0 .P R I I N ( s ' 0 9 0 - 0 )
6 ATTO d=20-32'
0 .S L C s b 1 g n e , c u t 1
7 EET uq.edr on()
0 .G O P B s b 1 g n e
8 RU Y uq.edr
0 .I S R O E W I E T B E s h o _ u m r
9 NET VRRT AL colsmay
1 .P R I I N ( s ' 0 9 0 - 0 )
0 ATTO d=20-32'
1 .S L C s b . c o l c u t 1
1 EET uqsho, on()
1 .G O P B s b 1 s h o
2 RU Y uq.col
38. 38
HiveQL
0 .F O
1 RM
0 .( E E T a s a u , b s h o , g g n e
2 SLC .tts .col .edr
0. FO sau_pae aJI poie b
3
RM ttsudts
ON rfls
0. O (.srd=buei adad ='090-0)sb1
4
N auei
.srd n .s
20-32' uq
0. ISR OEWIETBEgne_umr
5
NET VRRT AL edrsmay
0. PRIIN(s'090-0)
6
ATTO d=20-32'
0. SLC sb1gne,cut1
7
EET uq.edr on()
0. GOPB sb1gne
8
RU Y uq.edr
0 .I S R O E W I E T B E s h o _ u m r
9 NET VRRT AL colsmay
1 .P R I I N ( s ' 0 9 0 - 0 )
0 ATTO d=20-32'
1 .S L C s b . c o l c u t 1
1 EET uqsho, on()
1 .G O P B s b 1 s h o
2 RU Y uq.col
39. 39
HiveQL
0 .R D C s b 2 s h o , s b 2 m m , s b 2 c t
1 EUE uq.col uq.ee uq.n
0. UIG'o1.y A (col mm,ct
2
SN tp0p' S sho, ee n)
0 .F O (
3 RM
0.
4
SLC sb1sho,sb1mm,cut1 a ct
EET uq.col uq.ee on() s n
0.
5
FO
RM
0.
6
(A bsho,asau
MP .col .tts
0.
7
UIG'eeetatrp'
SN mm_xrco.y
0.
8
A (col mm)
S sho, ee
0.
9
FO sau_paeaJI poie b
RM ttsudt
ON rfls
1.
0
O (.srd=buei) sb1
N auei
.srd) uq
1.
1
GOPB sb1sho,sb1mm
RU Y uq.col uq.ee
1.
2
DSRBR B sho,mm
ITIUE Y col ee
1.
3
SR B sho,mm,ctds)
OT Y col ee n ec
1 .) s b 2
4
uq
41. 41
Hadoop + Data Warehouse
• Hadoop and Data Warehouses can co-exist
• DW: OLAP, BI, transactional data
• Hadoop: Raw, unstructured data
42. 42
ETL
• Extract: load to HDFS, parse, prepare
• Run some analysis
• Transform: clean data and transform to some structured format
• with MapReduce
• Load: extract from HDFS, load to DW
43. 43
ETL: examples
• Text processing
• Call center records analysis
• extract sentiment
• link to profile
• which customers are more important to keep?
• Image processing
44. 44
Active Storage
• Don't delete the data after processing
• Hadoop storage is cheap: it can store anything
• Run more analysis when needed
• Like: extract new keywords/features from the old dataset
45. 45
Active Storage - 2
• Up to 80% of data is dormant (or cold)
• Hadoop storage can be way cheaper than high-cost data management
solutions
• Move this data to Hadoop
• When needed quickly analyze there or move back to DW
49. 49
Analytical Sandbox
• What are we looking in this data?
• No structure - hard to know
• Run ad-hoc Hive queries to see what's there
50. 50
Conclusions
• Hadoop is becoming more and more popular
• Many companies plan to adopt
• Best used with existent DW solutions
• as an ETL
• as Active Storage
• as Analytical Sandbox
51. 51
References
1. Lee, Kyong-Ha, et al. "Parallel data processing with MapReduce: a survey." ACM SIGMOD Record 40.4 (2012): 11-20.
[pdf]
2. "MapReduce vs Data Warehouse". Webpage, [link]. Accessed 15/12/2013.
3. Ordonez, Carlos, Il-Yeol Song, and Carlos Garcia-Alvarado. "Relational versus non-relational database systems for
data warehousing." Proceedings of the ACM 13th international workshop on Data warehousing and OLAP. ACM, 2010.
[pdf]
4. A. Awadallah, D. Graham. "Hadoop and the Data Warehouse: When to Use Which." (2011). [pdf] (by Cloudera and
Teradata)
5. Thusoo, Ashish, et al. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB
Endowment 2.2 (2009): 1626-1629. [pdf]
6. Chen, Songting. "Cheetah: a high performance, custom data warehouse on top of MapReduce." Proceedings of the
VLDB Endowment 3.1-2 (2010): 1459-1468. [pdf]
52. 52
References
7. "How (and Why) Hadoop is Changing the Data Warehousing Paradigm." Webpage [link]. Accessed 15/12/2013.
8. P. Russom. "Integrating Hadoop into Business Intelligence and Data Warehousing." (2013). [pdf]
9. M. Ferguson. "Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop." [pdf]
10. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of
the ACM 51.1 (2008): 107-113. [pdf]
11. "What is Hadoop?" Webpage [link]. Accessed 15/12/2013.
12. Apache Hadoop project home page, url: [link].
13. Apache HBase home page, [link].
14. Apache Mahout home page, [link].
15. "How Hadoop Cuts Big Data Costs" [link]. Accessed 05/01/2014.
16. "The Impact of Data Temperature on the Data Warehouse." whitepaper by Terradata (2012). [pdf]
17. Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical
workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933. [pdf]