This document discusses big data and paradigm shifts. It provides examples of how internet companies like Pinterest, Instagram, and Tumblr have leveraged big data technologies to scale rapidly while maintaining small employee teams. The document also defines big data using the four V's of volume, velocity, variety and variability. Examples are given of how companies have used big data analytics to improve customer experiences and increase business metrics like booking conversions. Technologies discussed include Hadoop, NoSQL databases, and data warehousing appliances.
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)Laura Chiticariu
Invited talk at MIT CSAIL, March 8 2016
Information extraction (IE), the task of extracting structured information from unstructured or semi-structured data, is increasingly important to a wide array of enterprise applications, ranging from Business Intelligence to Data-as-a-Service. Such applications drive the following main requirements for IE systems: accuracy, scalability, expressivity, transparency, and customizability.
SystemT, a declarative IE system, has been designed and developed to address these requirements. It is based on the basic principle underlying relational database technology: complete separation of specification from execution. SystemT uses a declarative language for expressing NLP algorithms called AQL, and an optimizer that generates high-performance algebraic execution plans for AQL rules. It makes IE orders of magnitude more scalable and easy to use, maintain and customize.
SystemT ships today with multiple products across 4 IBM Software Brands. Furthermore, SystemT is used in multiple ongoing research projects and being taught in universities. Our ongoing research and development efforts focus on making SystemT more usable for both technical and business users, and continuing enhancing its core functionalities based on natural language processing, machine learning, and database technology.
Deloitte's report and point of view on IBM's Watson. IBM Watson, AI, Cognitive Computing are rapidly evolving technologies that can support and enhance enterprise solutions. Learn about IBM Watson the Why? and the How?
Under the grid computing paradigm, large sets of heterogeneous resources can be aggregated and shared. Grid development and acceptance hinge on proving that grids reliably support real applications, and on creating adequate benchmarks to quantify this support. However, applications of grids (and clouds) are just beginning to emerge, and traditional benchmarks have yet to prove representative in grid environments. To address this chicken-and-egg problem, we propose a middle-way approach: create and run synthetic grid workloads comprised of applications representative for today's grids (and clouds). For this purpose, we have designed and implemented GrenchMark, a framework for synthetic workload generation and submission. The framework greatly facilitates synthetic workload modeling, comes with over 35 synthetic and real applications, and is extensible and flexible. We show how the framework can be used for grid system analysis, functionality testing in grid environments, and for comparing different grid settings, and present the results obtained with GrenchMark in our multi-cluster grid, the DAS.
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...Alexandru Iosup
Data are pouring in, and defining and providing data-processing services at massive scale, in short, Big Data services, could significantly improve the revenue of Europe's Small and Medium Enterprises (SMEs). A paradigm shift is about occur, one in which data processing becomes a basic life utility, for both SMEs and the European people. Although the burgeoning datacenter industry, of which the Netherlands is a top player in Europe, is promising to enable Big Data services, the architectures and even infrastructure for these services are still lagging behind in performance, efficiency, and sophistication, and are built as monoliths reminding us of traditional data silos. Can we remove the performance and efficiency limitations of the current Big Data ecosystems, that is, of the complex stacks of middleware that are currently in use, for Big Data services? In this talk, I will present several use cases (workloads) of Big Data services for time-stamped [2,3] and graph data [4], evaluate or benchmark the performance of several Big Data stacks [3,4] for these use-cases, and present a path (and promising early results) to providing a generic, data-agnostic, non-monolithic Big Data architecture that can efficiently and elastically use datacenter resources via cloud computing interfaces [1,5].
[1] A. L. Varbanescu and A. Iosup, On Many-Task Big Data Processing: from GPUs to Clouds. Proc. of SC|12 (MTAGS).? http://www.pds.ewi.tudelft.nl/~iosup/many-tasks-big-data-vision13mtags_v100.pdf
[2] de Ruiter and Iosup. A workload model for MapReduce. MSc thesis at TU Delft. Jun 2012. Available online via TU Delft Library, http://library.tudelft.nl
[3] Hegeman, Ghit, Capotã, Hidders, Epema, Iosup. The BTWorld Use Case for Big Data Analytics: Description, MapReduce Logical Workflow, and Empirical Evaluation. IEEE Big Data 2013. http://www.pds.ewi.tudelft.nl/~iosup/btworld-mapreduce-workflow13ieeebigdata.pdf
[4] Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella, and T. L. Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis. IEEE IPDPS 2014. http://www.pds.ewi.tudelft.nl/~iosup/perf-eval-graph-proc14ipdps.pdf
[5] B. Ghit, N. Yigitbasi, A. Iosup, and D. Epema. Balanced Resource Allocations Across Multiple Dynamic MapReduce Clusters. ACM SIGMETRICS 2014. http://pds.twi.tudelft.nl/~iosup/dynamic-mapreduce14sigmetrics.pdf
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)Laura Chiticariu
Invited talk at MIT CSAIL, March 8 2016
Information extraction (IE), the task of extracting structured information from unstructured or semi-structured data, is increasingly important to a wide array of enterprise applications, ranging from Business Intelligence to Data-as-a-Service. Such applications drive the following main requirements for IE systems: accuracy, scalability, expressivity, transparency, and customizability.
SystemT, a declarative IE system, has been designed and developed to address these requirements. It is based on the basic principle underlying relational database technology: complete separation of specification from execution. SystemT uses a declarative language for expressing NLP algorithms called AQL, and an optimizer that generates high-performance algebraic execution plans for AQL rules. It makes IE orders of magnitude more scalable and easy to use, maintain and customize.
SystemT ships today with multiple products across 4 IBM Software Brands. Furthermore, SystemT is used in multiple ongoing research projects and being taught in universities. Our ongoing research and development efforts focus on making SystemT more usable for both technical and business users, and continuing enhancing its core functionalities based on natural language processing, machine learning, and database technology.
Deloitte's report and point of view on IBM's Watson. IBM Watson, AI, Cognitive Computing are rapidly evolving technologies that can support and enhance enterprise solutions. Learn about IBM Watson the Why? and the How?
Under the grid computing paradigm, large sets of heterogeneous resources can be aggregated and shared. Grid development and acceptance hinge on proving that grids reliably support real applications, and on creating adequate benchmarks to quantify this support. However, applications of grids (and clouds) are just beginning to emerge, and traditional benchmarks have yet to prove representative in grid environments. To address this chicken-and-egg problem, we propose a middle-way approach: create and run synthetic grid workloads comprised of applications representative for today's grids (and clouds). For this purpose, we have designed and implemented GrenchMark, a framework for synthetic workload generation and submission. The framework greatly facilitates synthetic workload modeling, comes with over 35 synthetic and real applications, and is extensible and flexible. We show how the framework can be used for grid system analysis, functionality testing in grid environments, and for comparing different grid settings, and present the results obtained with GrenchMark in our multi-cluster grid, the DAS.
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...Alexandru Iosup
Data are pouring in, and defining and providing data-processing services at massive scale, in short, Big Data services, could significantly improve the revenue of Europe's Small and Medium Enterprises (SMEs). A paradigm shift is about occur, one in which data processing becomes a basic life utility, for both SMEs and the European people. Although the burgeoning datacenter industry, of which the Netherlands is a top player in Europe, is promising to enable Big Data services, the architectures and even infrastructure for these services are still lagging behind in performance, efficiency, and sophistication, and are built as monoliths reminding us of traditional data silos. Can we remove the performance and efficiency limitations of the current Big Data ecosystems, that is, of the complex stacks of middleware that are currently in use, for Big Data services? In this talk, I will present several use cases (workloads) of Big Data services for time-stamped [2,3] and graph data [4], evaluate or benchmark the performance of several Big Data stacks [3,4] for these use-cases, and present a path (and promising early results) to providing a generic, data-agnostic, non-monolithic Big Data architecture that can efficiently and elastically use datacenter resources via cloud computing interfaces [1,5].
[1] A. L. Varbanescu and A. Iosup, On Many-Task Big Data Processing: from GPUs to Clouds. Proc. of SC|12 (MTAGS).? http://www.pds.ewi.tudelft.nl/~iosup/many-tasks-big-data-vision13mtags_v100.pdf
[2] de Ruiter and Iosup. A workload model for MapReduce. MSc thesis at TU Delft. Jun 2012. Available online via TU Delft Library, http://library.tudelft.nl
[3] Hegeman, Ghit, Capotã, Hidders, Epema, Iosup. The BTWorld Use Case for Big Data Analytics: Description, MapReduce Logical Workflow, and Empirical Evaluation. IEEE Big Data 2013. http://www.pds.ewi.tudelft.nl/~iosup/btworld-mapreduce-workflow13ieeebigdata.pdf
[4] Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella, and T. L. Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis. IEEE IPDPS 2014. http://www.pds.ewi.tudelft.nl/~iosup/perf-eval-graph-proc14ipdps.pdf
[5] B. Ghit, N. Yigitbasi, A. Iosup, and D. Epema. Balanced Resource Allocations Across Multiple Dynamic MapReduce Clusters. ACM SIGMETRICS 2014. http://pds.twi.tudelft.nl/~iosup/dynamic-mapreduce14sigmetrics.pdf
Building Resiliency and Agility with Data Virtualization for the New NormalDenodo
Watch: https://bit.ly/327z8UM
While the impact of COVID-19 is uniform across organisations in the region, a lot of how the organisation can recover from the impact and strive in the market would depend on their resiliency and business agility. An organisation’s data management strategy holds the key, as they tackle the challenges of siloed data sources, optimising for operational stability, and ensuring real time delivery of consistent and reliable information, irrespective of the data source or format.
Join this session to hear why large organisations are implementing Data Virtualization, a modern data integration approach in their data architecture to build resiliency, enhance business agility, and save costs.
In this session, you will learn:
- How to deliver clear strategy for agile data delivery across the enterprise without pains of traditional data integration
- How to provide a robust yet simple architecture for data governance, master data, data trust, data privacy and data access security implementation - all from single unified framework
- How to deploy digital transformation initiatives for Agile BI, Big Data, Enterprise Data Services & Data Governance
This presentation describes SGI's new offerings for Big Data in the enterprise:
* New SGI InfiniteData Cluster
* New Hadoop public Sandbox
* New SGI ObjectStore
* SGI InfiniteStorage Gateway
Learn more: http://sgi.com
Watch the video presentation: http://wp.me/p3RLEV-1jV
Building Resiliency and Agility with Data Virtualization for the New NormalDenodo
Watch: https://bit.ly/327z8UM
While the impact of COVID-19 is uniform across organisations in the region, a lot of how the organisation can recover from the impact and strive in the market would depend on their resiliency and business agility. An organisation’s data management strategy holds the key, as they tackle the challenges of siloed data sources, optimising for operational stability, and ensuring real time delivery of consistent and reliable information, irrespective of the data source or format.
Join this session to hear why large organisations are implementing Data Virtualization, a modern data integration approach in their data architecture to build resiliency, enhance business agility, and save costs.
In this session, you will learn:
- How to deliver clear strategy for agile data delivery across the enterprise without pains of traditional data integration
- How to provide a robust yet simple architecture for data governance, master data, data trust, data privacy and data access security implementation - all from single unified framework
- How to deploy digital transformation initiatives for Agile BI, Big Data, Enterprise Data Services & Data Governance
This presentation describes SGI's new offerings for Big Data in the enterprise:
* New SGI InfiniteData Cluster
* New Hadoop public Sandbox
* New SGI ObjectStore
* SGI InfiniteStorage Gateway
Learn more: http://sgi.com
Watch the video presentation: http://wp.me/p3RLEV-1jV
Applications need data, but the legacy approach of n-tiered application architecture doesn’t solve for today’s challenges. Developers aren’t empowered to build and iterate their code quickly without lengthy review processes from other teams. New data sources cannot be quickly adopted into application development cycles, and developers are not able to control their own requirements when it comes to data platforms.
Part of the challenge here is the existing relationship between two groups: developers and DBAs. Developers are trying to go faster, automating build/test/release cycles with CI/CD, and thrive on the autonomy provided by microservices architectures. DBAs are stewards of data protection, governance, and security. Both of these groups are critically important to running data platforms, but many organizations deal with high friction between these teams. As a result, applications get to market more slowly, and it takes longer for customers to see value.
What if we changed the orientation between developers and DBAs? What if developers consumed data products from data teams? In this session, Pivotal’s Dormain Drewitz and Solstice’s Mike Koleno will speak about:
- Product mindset and how balanced teams can reduce internal friction
- Creating data as a product to align with cloud-native application architectures, like microservices and serverless
- Getting started bringing lean principles into your data organization
- Balancing data usability with data protection, governance, and security
Presenter : Dormain Drewitz, Pivotal & Mike Koleno, Solstice
New Opportunities for Connected Data - Emil Eifrem @ GraphConnect Boston + Ch...Neo4j
Today’s complex data is not only big, but also semi-structured and densely connected. In this session we’ll look at how size, structure and connectedness have converged to transform the data landscape. We’ll then go on to look at some of the new opportunities for creating end-user value that have emerged in a world of connected data, illustrated with practical examples drawn from the telecommunications, social media and logistics sectors.
Because every organization produces and propagates data as part of their day-to-day operations, data trends are becoming more and more important in the mainstream business world’s consciousness. For many organizations in various industries, though, comprehension of this development begins and ends with buzzwords: “big data,” “NoSQL,” “data scientist,” and so on. Few realize that any and all solutions to their business problems, regardless of platform or relevant technology, rely to a critical extent on the data model supporting them. As such, Data Modeling is not an optional task for an organization’s data effort, but rather a vital activity that facilitates the solutions driving your business. Since quality engineering/architecture work products do not happen accidentally, the more your organization depends on automation, the more important the data models driving the engineering and architecture activities of your organization become. This webinar illustrates Data Modeling as a key activity upon which so much technology depends.
Solution Centric Architectural Presentation - Implementing a Logical Data War...Denodo
Watch full webinar here: https://bit.ly/3H5AYZf
Implementing a logical data fabric as an architecture makes absolute sense when you have data spread across various sources in the cloud, including data warehouses, data lakes and even realtime data. In this session our customer will discuss the ways in which they implemented Denodo as a logical data fabric and how it helped them reduce risk and speed up time to access data.
In this webinar, we talk with experts from Integration Developer News about the SnapLogic Elastic Integration Platform and adoption trends for iPaaS in the enterprise.
During the discussion, we address cloud application adoption challenges and 5 signs you need better cloud integration, including struggles with the "Integrator's Dilemma" and segregated integration.
To learn more, visit: www.snaplogic.com/connect-faster
Data architecture is foundational to an information-based operational environment. It is your data architecture that organizes your data assets so they can be leveraged in your business strategy to create real business value. Even though this is important, not all data architectures are used effectively. This webinar describes the use of data architecture as a basic analysis method. Various uses of data architecture to inform, clarify, understand, and resolve aspects of a variety of business problems will be demonstrated. As opposed to showing how to architect data, your presenter Dr. Peter Aiken will show how to use data architecting to solve business problems. The goal is for you to be able to envision a number of uses for data architectures that will raise the perceived utility of this analysis method in the eyes of the business.
Find more Data-Ed webinars here: www.datablueprint.com
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
How world-class product teams are winning in the AI era by CEO and Founder, P...
Big data 2012 v1
1. Paradigm Shifts:
Big Data
Pini Cohen
VP and Senior Analyst
Tell me and I’ll forget
Show me and I may remember STKI Summit 2012
Involve me and I’ll understand
2. The “Magic” of internet companies
Source: http://venturebeat.com/2011/10/24/next-hot-internet-companies-not-in-us/internet-company-growth/
Pini Cohen’s work Copyright STKI@2012
2
Do not remove source or attribution from any slide or graph
3. Pinterest
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 3
4. Pinterest Architecture Update - 18 Million Visitors, 10x Growth,12 Employees, 410
TB of Data
• 80 million objects stored in S3 with 410 terabytes of user
data, 10x what they had in August. EC2 instances have
grown by 3x. Around $39K fo S3 and $30K for EC2 a month.
• Pay for what you use saves money. Most traffic happens in
the afternoons and evenings, so they reduce the number of
instances at night by 40%.
• 12 employees as of last December. Using the cloud a site can
grow dramatically while maintaining a very small team.
Looks like 31 employees as of now.
Source: http://highscalability.com/blog/2012/5/21/pinterest-architecture-update-18-million-visitors-10x-growth.html
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 4
5. Instagram
• The Instagram philosophy:
• Simplicity
• Optimized for minimal operational burden
• Instrument everything
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 5
6. Scaling Instagram
• Instagram went to 30+ million users in less than two years
and then rocketed to 40 million users 10 days after the
launch of its Android application.
• After the release of the Android they had 1 million new
users in 12 hours.
• 2 engineers in 2010.
• 3 engineers in 2011
• 5 engineers 2012, 2.5 on the backend. This includes
iPhone and Android development.
Source: http://highscalability.com/blog/2012/4/16/instagram-architecture-update-whats-new-with-instagram.html
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 6
7. Tumblr – Microbloging social networking platform
• 500 million page views a day
• 15B+ page views month
• Peak rate of ~40k requests per second
• 1+ TB/day into Hadoop cluster
• Many TB/day into MySQL/HBase/Redis/Memcache
• Growing at 30% a month
• ~1000 hardware nodes in production (not cloud)
• ~20 engineers (total 106 employees)
Source: http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-billion-page-views-a-month-and-harder.html STKI modifications
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 7
8. Technology listing
• Hadoop Mapreduce
• NoSQL dbms (Cassandra, Mongo, HBASE)
• Shrading
• In Memory DBMS
• Memcashed
• MemSQL
• Solr
• Redis
• DJANGO
• Python
• ELB - Elastic load balancing amazon
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph
9. Paradigm shifts agenda
• Big Data:
• Big Data definition and background
• Big Data value
• Big Data technology
Source: http://www.b2binbound.com/blog/?Tag=paradigm%20shift
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 9
10. Big Data Definition – 4 V’s (or more…)
• Volume – tens of TBs and more (15-20TB+)
• Velocity – the speed in which data is added – 10M items
per hour and more. And the speed in which the data needs
to be processed
• Variety – different types of data – structured &
unstructured. In many cases deals with internet of things,
social media, but also with voice, video, etc.
• Variability - able to cope with new attributes and changing
data types – without interrupting the analytical process
(without “import-export”)
• Other optional V’s - validity, volatility, viscosity (resistance
to flow), etc. source: http://www.computerweekly.com/blogs/cwdn/2011/11/datas-main-drivers-volume-velocity-variety-and-variability.html
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 10
11. The origins of the 3V’s:
• 2002 research by Doug Laney from META Group (now
Gartner):
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 11
12. “Big Data” theme main current usage:
• “Big Data" is just marketing jargon. -Doug Laney,
Gartner source: http://www.computerweekly.com/blogs/cwdn/2011/11/datas-main-drivers-volume-velocity-variety-and-variability.html
Source: http://winnbadisa.com/wp-content/uploads/2011/12/marketing-career-cloud.jpg
• STKI : doing something significantly different from
what you’ve done until now
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 12
13. Big Data at work:
• Orbitz Worldwide has collected 750 terabytes of
unstructured data on their consumers’ behavior – detailed
information from customer online visits and browsing
sessions. Using Hadoop, models have been developed
intended to improve search results and tailor the user
experience based on everything from location, interest in
family travel versus solo travel, and even the kind of device
being used to explore travel options.
• The result? To date, a 7% increase in interaction rate, 37%
growth in stickiness of sessions and a net 2.6% in booking
path engagement.
Source: http://www.deloitte.com/assets/Dcom-UnitedStates/Local%20Assets/Documents/us_cons_techtrends2012_013112.pdf
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 13
14. DW appliances will be discussed later
Teradata EMC Greenplun Oracle Exadata
Source: http://www.asugnews.com/2011/09/06/inside-saps-product-naming-strategies/
Pini Cohen’s work Copyright STKI@2012
14
Microsoft Parallel Data Warehouse
Do not remove source or attribution from any slide or graph
15. What is the business value of big data analytics?
• Big data is now a technology looking for a business need
• It can mean doing the same thing but better / faster
(better segmentation, more accurate analysis model)
• Or it can mean doing completely new things (telematics,
sentiment analysis, recommendation engine, matching
competition’s pricing in real time, being able to analyze
data we haven’t been able to analyze in the past)
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph
16. Decision making – old school vs. new school (big data)
• Old School:
• Phase 1 : Analyze existing data and prepare general model
• Phase 2: Apply the general model to specific client
• This means applying the same model for many clients when they
arrive
• Issues with Old School decision making:
• Time gap between preparing and applying the model
• # of combinations might be too big for general model (example:
recommendation based in interest)
• The general model generated is biased towards “main stream”
population
• New School (Big Data):
• Phase 1: Prepare specific model for the client and apply the model
– instantly
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 16
17. Big data use cases
• Recommendation engines – match users to one another
and provide recommendation based on similar users
(Examples: Linkedin – people you may know; Amazon)
• Sentiment Analysis (Macro or individual user)
• Fraud Detection - customer behavior, historical and
transactional data combined. Same but more affordable
• Customer Churn
• Social graph analysis – influencers
• Customer experience analysis – combine data from call
center, web, social media etc.
• Improved segmentation – more data (clickstream, call
records) for more accurate analysis
• Improved customer retention
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph
18. Technology: Elements Concepts
• Storing data for analytics (mainly):
• HDFS – Hadoop File System
• Map Reduce- Programming method mainly for analytics
• Other “Add-on”: Pig, , Hive, JAQL (IBM)
• Storing and retrieving data - DBMS:
• NoSQL – DBMS (not only SQL):
• Cassandra
• MongoDB
• CouchDB
• Hbase
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 18
19. Who Uses Hadoop?
• Amazon/A9 Quantcast
• AOL
Rackspace/Mailtrust
• Facebook
• Fox interactive media
Veoh
• Netflix Yahoo!
• New York Times PowerSet (now
Microsoft)
More at http://wiki.apache.org/hadoop/PoweredBy
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 19
20. Who Uses Cassandra?
• Facebook SimpleGeo
• Digg Rackspace
• Despegar Shazam
• Ooyala SoftwareProjects
• Imagini
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 20
21. Big Data technologies (Hadoop etc.) vs. traditional IT
Traditional IT Big Data
Centralized Storage Local storage
Brand redundant Servers Cheap HW White Boxes
Standard Infrastructure and virtual Is standardization needed?! (in the HW
servers. level). No server virtualization.
Well established backup and DRP Why do I need backup? How do I tackle
procedures DRP (compute clusters that are stretched
over locations)
Traditional vendors Open Source solutions
Mature products and procedures In a new patch for specific issues
sometimes it is written “not implemented
yet”
Traditional programming, SQL Different kind of programming (map-
reduce) , no Joins
Will Big Data infrastructure be part of existing infrastructure or will be
developed as new domain?
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 21
22. New type of scale:
• Hadoop:
• Up to 4,000 machines in a cluster
• Up to 20 PB in a cluster
• Currently traditional IT technologies can not handle this
kind of scale.
• This scale comes with a cost!
Source: http://www.techsangam.com/wp-content/uploads/2012/01/i_love_scalability_mug.jpg
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 22
23. Brewer's (CAP) Theorem
• It is impossible for a distributed computer system to
simultaneously provide all three of the following
guarantees:
• Consistency (all nodes see the same data at the same time)
• Availability (node failures do not prevent survivors from
continuing to operate)
• Partition Tolerance (the system continues to operate in many
partitions and despite arbitrary message loss)
Source: Scalebase STKI modifications
Professor Eric A. Brewer
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 23
24. Dealing With CAP
• Drop Consistency
• Welcome to the “Eventually Consistent” term.
• At the end – everything will work out just fine - And hey, sometimes
this is a good enough solution
• When no updates occur for a long period of time, eventually all
updates will propagate through the system and all the nodes will
be consistent
• For a given accepted update and a given node, eventually either
the update reaches the node or the node is removed from service
• Known as BASE (Basically Available, Soft state, Eventual
consistency), as opposed to ACID
Source: Scalebase
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 24
25. Hadoop
• Apache Hadoop is a software framework that supports
data-intensive distributed applications
• It enables applications to work with thousands of nodes
and petabytes of data.
• Hadoop was inspired by Google's MapReduce and Google
File System (GFS) papers
• Contains (basically):
• HDFS – Hadoop file System
• MapReduce programming model
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 25
26. HDFS – Hadoop File System
• Parallel
• Distributed on commodity elements
• Throughput over latency
• Reliable and self healing
• For large scale – typical file is gigabytes to terabytes (for
one file!)
• Applications need a write-once-read-many access
model (mainly analytics)
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 26
27. HDFS motivation
• What if you needed to write a program that distributes
data on commodity HW (PC’s or Servers). You would need
to take care of:
• Where is the data located
• How to distribute data between the nodes
• How many times you want to replicate the data
• How to insert, select and update data
• What to do if one node or more fails
• How to add node or to take out a node
• Manage and monitor the environment
• Hadoop File System did it for you!
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 27
28. HDFS: Hadoop Distributed File Systems
• Data nodes and Name node
• Client requests meta data about a file from namenode
• Data is served directly from datanode
HDFS namenode
Application
(file name, block id)
HDFS Client File namespace /user/css534/input
(block id, block location)
block 3df2
instructions state
(block id, byte range)
HDFS datanode HDFS datanode
block data
Linux local file system Linux local file system
… …
source: http://www.google.co.il/url?sa=t&rct=j&q=Rob+Jordan++Chris+Livdahl+hadoop+filetype%3Apptx&source=web&cd=1&ved=0CCIQFjAA
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 28
29. Datanode Blockreports
File “part-0” will be
replicated twice and will
populatesaved in blocks 1
and 3 (file is big so it has to
be divided to 2 blocks)
Block 1 is on data nodes A and C
source: http://www.google.co.il/url?sa=t&rct=j&q=Rob+Jordan++Chris+Livdahl+hadoop+filetype%3Apptx&source=web&cd=1&ved=0CCIQFjAA
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 29
30. HDFS basic limitations
• Namenode is single point of failure
• Write-once model
• Plan to support appending-writes
• A namespace with an extremely large number of files
exceeds Namenode’s capacity to maintain
• Cannot be mounted by exisiting OS
• Getting data in and out is tedious
• HDFS does not implement / support user quotas / access
permissions
• Data balancing schemes
• No periodic checkpoints
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 30
31. Map Reduce programming model
• In very basic – Brings the program to the data
• Contains two elements:
• Map: this part of the job is performed in parallel asynchronous
by each node
• Reduce: gather the result from the relevant nodes
• In more detail :
• Map : return (write on temp file) a list containing zero or more
( k, v ) pairs
• Output can be a different key from the input
• Output can have same key
• Reduce : return a new list of reduced output from input
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 31
32. MapReduce motivation
• What if you needed to write a program that processes data
that’s on distributed computers?
• You would need to write distributed program that:
• Finds where the data located
• Work on each node and then combine the result from each node
together.
• Where (on the local node) and how (format) to write the
intermediate results
• Find when the jobs of all participating nodes have concluded and
then start the “aggregation” part
• What to do if a job is stuck (restart the job or turn to another node
to perform the same job)
• Hadopp MapReduce is the framework for you!
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 32
33. MapReduce example:
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 33
34. Dataflow in Hadoop
Master Job: Word Count
Submit job
All elements – standard HW
map schedule reduce
map reduce
Source: Haifa Labs IBM
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 34
35. Dataflow in Hadoop
Hello World Bye World
Read Hello 1
Input File World 2
map reduce
Block 1 Bye1
Hello Hadoop Goodbye Hadoop
HDFS
Block 2 Hello 1
map Hadoop 2 reduce
Goodbye 1
Source: Haifa Labs IBM
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 35
36. Dataflow in Hadoop
Finished Finished + Location
map Local
FS
reduce
Local
map FS reduce
Source: Haifa Labs IBM
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 36
37. Dataflow in Hadoop
map Local
FS
reduce
HTTP GET
Local
map FS reduce
Source: Haifa Labs IBM
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 37
38. Dataflow in Hadoop
Write
Final
reduce
Answer
HDFS
reduce Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
Source: Haifa Labs IBM
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 38
39. Components of Cluster Node
Flow File Input
Processor
Flow Analysis Flow Analysis • Flow file
Cluster File Map Reduce
Cluster File
Map Reduce input processor
System
(System)
HDFS • Flow analysis
flow- ( HDFS )
MapReduce Library map/reduce
tools
• Flow-tools
Hadoop • Hadoop
• HDFS
Java Virtual Machine
• MapReduce
Operating System ( Linux ) • Java VM
• OS : Linux
Hardware ( CPU, HDD, Memory, NIC )
Source: www.caida.org/workshops/.../wide-casfi1004_wkang.ppt
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 39
40. Hive: MapReduce helper:
• Code Example:
• hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
• hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a
WHERE a.key < 100;
• hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.*
FROM events a;
• hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites,
a.pokes FROM profiles a;
• hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*)
FROM invites a WHERE a.ds='2008-08-15';
• hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar
FROM invites a;
• hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT
SUM(a.pc) FROM pc1 a;
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 40
41. NoSQL DBMS: storing and retrieving data
• Key/Value
• A big hash table
• Examples: Voldemort, Amazon’s Dynamo
• Big Table
• Big table, column families
• Examples: Hbase, Cassandra
• Document based
• Collections of collections
• Examples: CouchDB, MongoDB
• Graph databases
• Based on graph theory
• Examples: Neo4J
• Each solves a different problem
Source: Scalebase
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 41
42. Pros/Cons
• Pros:
• Performance
• BigData
• Most solutions are open source
• Data is replicated to nodes and is therefore fault-tolerant
(partitioning)
• Don't require a schema
• Can scale up and down
• Cons:
• Code change
• No framework support
• Not ACID
• Eco system (BI, Backup)
• There is always a database at the backend
• Some API is just too simple
Source: Scalebase
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 42
43. Apache Cassandra
• Cassandra is a highly scalable, eventually
consistent, distributed, structured key-value
store
• Child of Google’s BigTable and Amazon’s
Dynamo
• Peer to peer architecture. All nodes are equal Source: ids.snu.ac.kr/w/images/1/18/2011SS-03.ppt
• Cassandra’s replication factor (RF) is the total
number of nodes onto which the data will be
placed. RF of at least 2 is highly recommended,
keeping in mind that your effective number of
nodes is (N total nodes / RF).
• CQL (Cassandra Query Language) command line
• Time stamp for each value written
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 43
44. Consistent Hashing
• Partition using consistent hashing (for the
first node data is placed) based on MD5
Distributed hash table algorithm A
• Keys hash to a point on a fixed circular
C
space V B
• Ring is partitioned into a set of ordered
slots and servers and keys hashed over
these slots
• Nodes take positions on the circle. S D
• A, B, and D exists.
• B responsible for AB range ( for replication
factor=2 – default).
• D responsible for BD range.
• A responsible for DA range. R H
• C joins.
• B, D split ranges. M
• C gets BC from D.
Source: http://www.intertech.com/resource/usergroup/NoSQL.ppt
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 44
45. Cassandra’s tunable consistency (write)
Level Behavior
Ensure that the write has been written to at least 1 node, including HintedHandoff
ANY
recipients.
Ensure that the write has been written to at least 1 replica's commit log and
ONE
memory table before responding to the client.
Ensure that the write has been written to at least 2 replica's before responding to
TWO
the client.
Ensure that the write has been written to at least 3 replica's before responding to
THREE
the client.
Ensure that the write has been written to N / 2 + 1 replicas before responding to the
QUORUM
client.
Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes, within
LOCAL_QUORUM
the local datacenter (requires NetworkTopologyStrategy)
Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes in each
EACH_QUORUM
datacenter (requires NetworkTopologyStrategy)
Ensure that the write is written to all N replicas before responding to the client. Any
ALL
unresponsive replicas will fail the operation.
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph Source: wiki
45
46. Cassandra’s data model structure
Think of cassandra as row-oriented
keyspace
column family
settings
(eg,
partitioner) settings column
(eg,
comparator,
type [Std]) name value clock
Source: http://assets.en.oreilly.com/1/event/51/Scaling%20Web%20Applications%20with%20Cassandra%20Presentation.ppt
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 46
47. Data Model – “flexible” scheme!
ColumnFamily: Rockets
Key Value
1 Name Value
name Rocket-Powered Roller Skates
toon Ready, Set, Zoom
inventoryQty 5
brakes false
2 Name Value
name Little Giant Do-It-Yourself Rocket-Sled Kit
toon Beep Prepared
inventoryQty 4
brakes false
3 Name Value
name Acme Jet Propelled Unicycle
toon Hot Rod and Reel
inventoryQty 1
wheels 1
Source: http://wenku.baidu.com/view/6e254321482fb4daa58d4b87.html
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 47
48. Cassandra’s CQL – Cassandra SQL Language
• SQL like. Example:
• CREATE KEYSPACE test with strategy_class = 'SimpleStrategy' and
strategy_options:replication_factor=1;
• CREATE INDEX ON users (birth_date);
• SELECT * FROM users WHERE state='UT' AND birth_date > 1970;
• However:
• No Joins
• No UPDATES/DELETES
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 48
49. NoSQL benchmark – for scale!
Source: r esearch.yahoo.com/files/ycsb-v4.pdf
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 49
50. Can we live with NoSQL limitations?
• Facebook has dropped Cassandra
• “..we found Cassandra's eventual consistency model to be a
difficult pattern to reconcile for our new Messages
infrastructure”
• Facebook has selected HBase (Columnar DBMS) .
http://www.facebook.com/notes/facebook-engineering/the-underlying-technology-of-
messages/454991608919
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 50
51. What about other NoSQL DBMS?
• MongoDB
• Hbase
• CouchDB
• Maybe next session….
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 51
52. Big Data potential implications on IT
• Will traditional RDBMS be obsolete? Surely no!
• Several areas are Big Data zone by definition – Internet
marketing, Cyber, DW, etc.
• How well can we live with “Eventually Consistent” which in
most cases means 1-2 minutes delay?!
• Can we define that all batch data can live well on Big Data
technologies?
• Will we see at the end (10 years form now) that only small
portion of data still resides on RDBMS and most of the data
resides on Big Data technologies?!
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 52
53. Big data challenges
• NLP in Hebrew (entity recognition is more difficult)
• Adapting analytical algorithms to match big data world
(Anomaly detection needs to be redefined)
• Some problem with consistency
• Skiils problem – BI needs to program in Java, Hadoop,
NoSQL knowledge
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph
54. Example of big data technology: SPLUNK
• Splunk is a traditional IT vendor based on MapReduce
(from 2009)
Pini Cohen’s work Copyright STKI@2012
Do not remove source or attribution from any slide or graph 54
55. Thanks for your patience and hope you enjoyed
Here you can find the latest version of this presentation http://www.slideshare.net/pini
Pini Cohen’s work Copyright STKI@2012
55
Do not remove source or attribution from any slide or graph