How to get started in Big Data without Big Costs - StampedeCon 2016StampedeCon
Looking to implement Hadoop but haven’t pulled the trigger yet? You are not alone. Many companies have heard the hype about how Hadoop can solve the challenges presented by big data, but few have actually implemented it. What’s preventing them from taking the plunge? Can it be done in small steps to ensure project success?
This session will discuss some of the items to consider when getting started with Hadoop and how to go about making the decision to move to the de facto big data platform. Starting small can be a good approach when your company is learning the basics and deciding what direction to take. There is no need to invest large amounts of time and money up front if a proof of concept is all you aim to provide. Using well known data sets on virtual machines can provide a low cost and effort implementation to know if your big data journey will be successful with Hadoop.
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.
Joe Caserta presented Incorporating the Data Lake into Your Analytics Architecture.
For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopCaserta
In our most recent Big Data Warehousing Meetup, we learned about transitioning from Big Data 1.0 with Hadoop 1.x with nascent technologies to the advent of Hadoop 2.x with YARN to enable distributed ETL, SQL and Analytics solutions. Caserta Concepts Chief Architect Elliott Cordo and an Actian Engineer covered the complete data value chain of an Enterprise-ready platform including data connectivity, collection, preparation, optimization and analytics with end user access.
For more information on our services or upcoming events, please visit our website at http://www.casertaconcepts.com/.
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
Creating a Data Science Team from an Architect's perspective. This is about team building on how to support a data science team with the right staff, including data engineers and devops.
Introduction to Kudu - StampedeCon 2016StampedeCon
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...BigDataEverywhere
Mohammad Quraishi, Senior IT Principal, Cigna
Like Moses seeing the Promised Land from afar, we knew the big data journey would be worth it, but we didn't know how hard it would be. In this talk, I'll delve into the details of our big data and analytics initiative at Cigna,
How to get started in Big Data without Big Costs - StampedeCon 2016StampedeCon
Looking to implement Hadoop but haven’t pulled the trigger yet? You are not alone. Many companies have heard the hype about how Hadoop can solve the challenges presented by big data, but few have actually implemented it. What’s preventing them from taking the plunge? Can it be done in small steps to ensure project success?
This session will discuss some of the items to consider when getting started with Hadoop and how to go about making the decision to move to the de facto big data platform. Starting small can be a good approach when your company is learning the basics and deciding what direction to take. There is no need to invest large amounts of time and money up front if a proof of concept is all you aim to provide. Using well known data sets on virtual machines can provide a low cost and effort implementation to know if your big data journey will be successful with Hadoop.
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.
Joe Caserta presented Incorporating the Data Lake into Your Analytics Architecture.
For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopCaserta
In our most recent Big Data Warehousing Meetup, we learned about transitioning from Big Data 1.0 with Hadoop 1.x with nascent technologies to the advent of Hadoop 2.x with YARN to enable distributed ETL, SQL and Analytics solutions. Caserta Concepts Chief Architect Elliott Cordo and an Actian Engineer covered the complete data value chain of an Enterprise-ready platform including data connectivity, collection, preparation, optimization and analytics with end user access.
For more information on our services or upcoming events, please visit our website at http://www.casertaconcepts.com/.
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
Creating a Data Science Team from an Architect's perspective. This is about team building on how to support a data science team with the right staff, including data engineers and devops.
Introduction to Kudu - StampedeCon 2016StampedeCon
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...BigDataEverywhere
Mohammad Quraishi, Senior IT Principal, Cigna
Like Moses seeing the Promised Land from afar, we knew the big data journey would be worth it, but we didn't know how hard it would be. In this talk, I'll delve into the details of our big data and analytics initiative at Cigna,
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
Why use a datalake? Why use lambda? A conversation starter for Toronto Data Unconference 2015. We will discuss technologies such as Hadoop, Kafka, Spark Streaming, and Cassandra.
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Jen Stirrup
This session focused on data visualisation using Power BI, based on big data. Some examples of Hive and HDFS file storage are given. An overview of Microsoft HDInsight is supplied.
Getting Started with Big Data in the CloudRightScale
Find out what others are doing with big data on the cloud and how to get started. We will cover solutions for Hadoop and NoSQL highlighting RightScale partner technologies IBM BigInsights, Couchbase, and MongoDB.
Why data warehouses cannot support hot analyticsImply
Check out the full webinar: https://imply.io/videos/why-data-warehouses-cannot-support-hot-analytics
Today’s data warehouses - whether traditional, specialized or cloud-based - are good at supporting cold analytics, such as reporting, where query times can take minutes. But they cannot cost-effectively support hot analytics—interactive ad hoc analytics usually performed by larger groups of users against batch or streaming data. Examples of hot analytics include clickstream analytics; service, network and application performance monitoring; and risk analytics.
Data warehouses struggle with hot analytics use cases because they are too slow, unable to scale, or too expensive. Learn how a new class of real-time data platforms overcome these limitations, and how companies implement a “temperature-based” approach to analytics.
Analyzing big data is a challenge, requiring lots of processing power and storage.
Cloud Computing is an ideal platform to tackle this problem. HD Insight on Microsoft Azure deploys Hadoop and other open source big data tools to the cloud, making it easier to take advantage of the high scalability of this platform.
In this session, you will learn what tools are available in HD Insight and how to use them to store, process, and analyze large amounts of data.
Introducing Big Data concepts & Hadoop to those who wish to begin their journey in the future of Information Technology. It is certain that data is going to play a major role in days to come, from our daily lives to the biggest of the venture we might undertake.And hence, knowing about Big data and technologies to work with the same is going to be essential for IT professionals.
We will start with some simple presentations and then will go on building upon it. For more intense, focused introduction & training on Big Data and related technologies, visit our website or write to us.
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Mark Rittman
As presented at OGh SQL Celebration Day 2016 - including new content on why NoSQL and Hadoop is a better solution for social network analysis than the Oracle Database (for now...)
Here I talk about examples and use cases for Big Data & Big Data Analytics and how we accomplished massive-scale sentiment, campaign and marketing analytics for Razorfish using a collecting of database, Big Data and analytics technologies.
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Desing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the basics of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
Way too many enterprise decision makers have clouded and uninformed views of how Hadoop works and what it does. Donald Miner offers high-level observations about Hadoop technologies and explains how Hadoop can shift the paradigms inside of an organization, based on his report Hadoop: What You Need To Know—Hadoop Basics for the Enterprise Decision Maker, forthcoming from O’Reilly Media.
After a basic introduction to Hadoop and the Hadoop ecosystem, Donald outlines 10 basic concepts you need to understand to master Hadoop:
Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
Hadoop handles unstructured data: why Hadoop is better for unstructured data than other data systems from a storage and computation perspective
In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
Hadoop is open source: what it really means for Hadoop to be open source from a practical perspective, not just a “feel good” perspective
HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
YARN controls everything going on and is mostly behind the scenes: an overview of YARN and the pitfalls of sharing resources in a distributed environment and the capacity scheduler
MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
The Hadoop ecosystem is constantly growing and evolving: an overview of current tools such as Spark and Kafka and a glimpse of some things on the horizon
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
Why use a datalake? Why use lambda? A conversation starter for Toronto Data Unconference 2015. We will discuss technologies such as Hadoop, Kafka, Spark Streaming, and Cassandra.
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Jen Stirrup
This session focused on data visualisation using Power BI, based on big data. Some examples of Hive and HDFS file storage are given. An overview of Microsoft HDInsight is supplied.
Getting Started with Big Data in the CloudRightScale
Find out what others are doing with big data on the cloud and how to get started. We will cover solutions for Hadoop and NoSQL highlighting RightScale partner technologies IBM BigInsights, Couchbase, and MongoDB.
Why data warehouses cannot support hot analyticsImply
Check out the full webinar: https://imply.io/videos/why-data-warehouses-cannot-support-hot-analytics
Today’s data warehouses - whether traditional, specialized or cloud-based - are good at supporting cold analytics, such as reporting, where query times can take minutes. But they cannot cost-effectively support hot analytics—interactive ad hoc analytics usually performed by larger groups of users against batch or streaming data. Examples of hot analytics include clickstream analytics; service, network and application performance monitoring; and risk analytics.
Data warehouses struggle with hot analytics use cases because they are too slow, unable to scale, or too expensive. Learn how a new class of real-time data platforms overcome these limitations, and how companies implement a “temperature-based” approach to analytics.
Analyzing big data is a challenge, requiring lots of processing power and storage.
Cloud Computing is an ideal platform to tackle this problem. HD Insight on Microsoft Azure deploys Hadoop and other open source big data tools to the cloud, making it easier to take advantage of the high scalability of this platform.
In this session, you will learn what tools are available in HD Insight and how to use them to store, process, and analyze large amounts of data.
Introducing Big Data concepts & Hadoop to those who wish to begin their journey in the future of Information Technology. It is certain that data is going to play a major role in days to come, from our daily lives to the biggest of the venture we might undertake.And hence, knowing about Big data and technologies to work with the same is going to be essential for IT professionals.
We will start with some simple presentations and then will go on building upon it. For more intense, focused introduction & training on Big Data and related technologies, visit our website or write to us.
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Mark Rittman
As presented at OGh SQL Celebration Day 2016 - including new content on why NoSQL and Hadoop is a better solution for social network analysis than the Oracle Database (for now...)
Here I talk about examples and use cases for Big Data & Big Data Analytics and how we accomplished massive-scale sentiment, campaign and marketing analytics for Razorfish using a collecting of database, Big Data and analytics technologies.
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Desing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the basics of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
Way too many enterprise decision makers have clouded and uninformed views of how Hadoop works and what it does. Donald Miner offers high-level observations about Hadoop technologies and explains how Hadoop can shift the paradigms inside of an organization, based on his report Hadoop: What You Need To Know—Hadoop Basics for the Enterprise Decision Maker, forthcoming from O’Reilly Media.
After a basic introduction to Hadoop and the Hadoop ecosystem, Donald outlines 10 basic concepts you need to understand to master Hadoop:
Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
Hadoop handles unstructured data: why Hadoop is better for unstructured data than other data systems from a storage and computation perspective
In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
Hadoop is open source: what it really means for Hadoop to be open source from a practical perspective, not just a “feel good” perspective
HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
YARN controls everything going on and is mostly behind the scenes: an overview of YARN and the pitfalls of sharing resources in a distributed environment and the capacity scheduler
MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
The Hadoop ecosystem is constantly growing and evolving: an overview of current tools such as Spark and Kafka and a glimpse of some things on the horizon
First presentation for Savi's sponsorship of the Washington DC Spark Interactive. Discusses tips and lessons learned using Spark Streaming (24x7) to ingest and analyze Industrial Internet of Things (IIoT) data as part of a Lambda Architecture
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Spark Summit
Everybody agrees that IoT is changing the world… and creates new challenges for software developers, architects and DevOps. How can we build efficient and highly scalable distributed applications using open-source technologies? What are characteristics of data generated by IoT devices and how it differs from traditional enterprise or Big Data problems? Which architectural patterns are beneficial for IoT use cases and why some trusted methods eventually turn out to be “anti-patterns”? This talk will show how to combine best-of-breed open-source technologies, like Apache Spark, Riak and Mesos to build scalable IoT pipelines to ingest, store and analyze huge amounts of data, while keeping operational complexity and costs under control. We will discuss cons and pros of using relational, NoSQL and object storage products for storing and archiving IoT data. Then we cover best practices how to use Spark with Riak NoSQL database. Will describe how Apache Spark advanced modules (Spark SQL, Spark Streaming and MLlib) can solve the problems common to IoT apps, while using Riak for fast and scalable persistence. At the end, will explain why Structured Spark Streaming is a godsend for IoT data and make a case for Time Series databases deserving a separate category in NoSQL classification.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. Dependent on the size and quantity of such events, this can quickly be in the range of Big Data. How can we efficiently collect and transmit these events? How can we make sure that we can always report over historical events? How can these new events be integrated into traditional infrastructure and application landscape?
Starting with a product and technology neutral reference architecture, we will then present different solutions using Open Source frameworks and the Oracle Stack both for on premises as well as the cloud.
MiNiFi is a recently started sub-project of Apache NiFi that is a complementary data collection approach which supplements the core tenets of NiFi in dataflow management, focusing on the collection of data at the source of its creation. Simply, MiNiFi agents take the guiding principles of NiFi and pushes them to the edge in a purpose built design and deploy manner. This talk will focus on MiNiFi's features, go over recent developments and prospective plans, and give a live demo of MiNiFi.
The config.yml is available here: https://gist.github.com/JPercivall/f337b8abdc9019cab5ff06cb7f6ff09a
Strata San Jose 2017 - Ben Sharma PresentationZaloni
Learn about the promise of data lakes:
- Store all types of data in its raw format
- Create refined, standardized, trusted datasets for various use cases
- Store data for longer periods of time to enable historical analysis - Query and Access the data using a variety of methods
- Manage streaming and batch data in a converged platform
- Provide shorter time-to-insight with proper data management and governance
5 Things that Make Hadoop a Game Changer
Webinar by Elliott Cordo, Caserta Concepts
There is much hype and mystery surrounding Hadoop's role in analytic architecture. In this webinar, Elliott presented, in detail, the services and concepts that makes Hadoop a truly unique solution - a game changer for the enterprise. He talked about the real benefits of a distributed file system, the multi workload processing capabilities enabled by YARN, and the 3 other important things you need to know about Hadoop.
To access the recorded webinar, visit the event site: https://www.brighttalk.com/webcast/9061/131029
For more information the services and solutions that Caserta Concepts offers, please visit http://casertaconcepts.com/
Are you confused by Big Data? Get in touch with this new "black gold" and familiarize yourself with undiscovered insights through our complimentary introductory lesson on Big Data and Hadoop!
Explores the notion of "Hadoop as a Data Refinery" within an organisation, be it one with an existing Business Intelligence system or none - looks at 'agile data' as a a benefit of using Hadoop as the store for historical, unstructured and very-large-scale datasets.
The final slides look at the challenge of an organisation becoming "data driven"
Hadoop as Data Refinery - Steve LoughranJAX London
Apache Hadoop is often described as a "Big Data Platform" but what does that mean? One way to better understand Hadoop is to talk about how Hadoop is used. This talk discusses using Hadoop as a "Data Refinery", which is a common use case. The concept is very much like a traditional oil refinery except with data, pulling in large quantities of "crude data" over pipelines, refining some into useful business intelligence; refining other pieces into slightly less crude data that stays in the cluster until needed later. This metaphor proves useful when considering how Hadoop could be adopted in an organisation that already has data warehousing and business intelligence systems -and when contemplating how to hook up a Hadoop cluster to the sources of data inside and outside that organisation. A key point to remember is that storing data in Hadoop is not a means to an end any more than storing data in a database is: it is extracting information from that data. Using Hadoop as a front end "data refinery" means that it can integrate with existing Business Intelligence systems, while providing the platform for new applications.
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Eric Baldeschwieler
A summary of the History of Hadoop, some observations about the current state of Hadoop for new users and some predictions about its future (Hint, it's gonna be huge).
Presented at:
http://www.meetup.com/Pasadena-Big-Data-Users-Group/events/203961192/
Big Data brings big promise and also big challenges, the primary and most important one being the ability to deliver Value to business stakeholders who are not data scientists!
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupCaserta
During the Big Data Warehousing Meetup, we discussed options for enabling real-time/interactive queries to support business intelligence type functionality on Hadoop. Also, Hortonworks provided a deep-dive demo of Stinger! You can access that slideshow here: http://www.slideshare.net/CasertaConcepts/stinger-initiative-hortonworks
If you would like more information, please don't hesitate to contact us at info@casertaconcepts.com. Or, visit our website at http://casertaconcepts.com/.
Better Together: The New Data Management OrchestraCloudera, Inc.
To ingest, store, process and leverage big data for maximum business impact requires integrating systems, processing frameworks, and analytic deployment options. Learn how Cloudera’s enterprise data hub framework, MongoDB, and Teradata Data Warehouse working in concert can enable companies to explore data in new ways and solve problems that not long ago might have seemed impossible.
Gone are the days of NoSQL and SQL competing for center stage. Visionary companies are driving data subsystems to operate in harmony. So what’s changed?
In this webinar, you will hear from executives at Cloudera, Teradata and MongoDB about the following:
How to deploy the right mix of tools and technology to become a data-driven organization
Examples of three major data management systems working together
Real world examples of how business and IT are benefiting from the sum of the parts
Join industry leaders Charles Zedlewski, Chris Twogood and Kelly Stirman for this unique panel discussion, moderated by BI Research analyst, Colin White.
Hadoop based data Lakes have become increasingly popular within today’s modern data architectures for their ability to scale, handle data variety and low cost. Many organizations start slow with the data lake initiatives but as they grow bigger, they suffer with challenges on data consistency, quality and security, resulting in losing confidence in their data lake initiatives.
This talk will discuss the need for good data governance mechanisms for Hadoop data lakes and it relationship with productivity and how it helps organizations meet regulatory and compliance requirements. The talk advocates carrying a different mindset for designing and implementing flexible governance mechanisms on Hadoop data lakes.
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
Big Data projects are a struggle, not only on the technical side but also on the organizational side. In this talk the author shares his experience and opinions from almost 5 years of Big Data projects and develops an Agile Big Data Model which reflects his ideas on how Big Data projects can be successful, even in large companies.
Talk held at the crossover meetup of the "Agile Stammtisch Rhein-Main" and the "Hadoop & Spark User Group Rhein-Main" at codecentric AG on 31.01.2017.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Monitoring Java Application Security with JDK Tools and JFR Events
Utah Big Mountain Big Data Baby Steps (4-12-2014) Final
1. 1
Big Data, Baby Steps
“What Every Leader Should Consider When Starting a Big Data Initiative”
April 12, 2014
2. Goal for this presentation
“Big data is like teenage sex: everyone talks about it,
nobody really knows how to do it, everyone thinks everyone
else is doing it, so everyone claims they are doing it...”
- Dan Ariely on Facebook Jan 6, 2013 and “others”
2
Why Me? Why Ancestry?
• Established consumer facing Web company looking to
leverage our data
• Started with Hadoop and HBase in 2012 on AncestryDNA
• When we started, I looked for guidance – it was missing
• Learn from us: what works, what didn’t, how to adjust
3. Agenda
• What to consider before you start
• Understand the Hadoop ecosystem
– What pieces is Ancestry using and why?
– Big Data architecture at Ancestry
• Hadoop distributions
• Big Data consultants
• How to build your team(s)
• Custom logs
– Other companies and Ancestry specifics
• Top three things to remember
3
4. Gartner new technology hype cycles
Where is Big Data (and Big Data Analytics) on this curve?
4
Source: Gartner August 2013
5. What to consider before you start
• Big Data, Business Intelligence, and Analytics are tied
– Analytics is an umbrella term that represents the entire
ecosystem needed to turn data into actions
• Understand your “data”
– Web click stream data, sales transactions, advertising data,
fraud detection, sensor data, social data, etc.
• Visualize your final goal and work backwards
– Imagine (prototype) the dashboards, analytics, and actions
that will be available
• Deliver value to the business at each step
– “Goal of analytics is not to produce actionable insights; the
goal is to produce results.” Ken Rudin
5
6. Understand the Hadoop ecosystem
• Hadoop 2.0 and HDFS (Yarn)
• Workflow
• NoSQL
• Data Organization
• Log collection
• Near Real-Time Stream Processing
• NFS File System on HDFS
6
7. What are the pieces Ancestry is using?
We use or plan to use:
Yarn and Ambari
Forensics on log data:
Visualization:
(Graphs + Deep Zoom)
7
8. Visualization
Company that used traditional “Cubes” and Excel
– Business Intelligence/Data Warehouse world has moved
beyond cubes
– Great product that didn’t work for us
– People went back to using Excel
– In two weeks, 30 people created 120+ dashboards and reports
– Tied to an MPP Data Warehouse is changing our company
– Created the “Wild, wild, west” - fixing with a blessed portal
8
9. Hadoop distributions
• Open Source, Active Community, Large Eco-System of
Projects, requires more internal knowledge and support
• First Distribution, Large “War Chest” (Cash Investment),
Impala, and the Cloudera Console
• Custom file system (API equivalent to HDFS) that improves
performance, custom Hbase implementation, High
Availability Features
• Closest to Apache Hadoop, tested on Yahoo!’s 7000 node
cluster before being released.
• Several Cloud options: Google and Amazon. Quick and
easy to get going. Great way to experiment and learn.
Watch your data storage costs
9
10. Typical Big Data architecture
Cassandra Repo
Users Properties
User
Properties
User
Segments
Rules
Defines
Samza Stream
Processing
Stream A Stream B
Stream C
Kafka
Stream Repo
Runs on
Hadoop
System of Records
Simple ETL
Raw Data
Global Properties & Models
Marketing
Segmentation and
Targeting Managment
Expose to the
Web Site
User Facing Stacks and Services
Log Forwarder Kafka Producer
EDW
(MPP)
Simple ELT
MapReduce
ETL
Designs
Actions Feeds
10
11. Ancestry system diagram
11
Hadoop
System of Records
Dogwood
ELT
User 360 Services Initiative
Kafka Log Forwarder
EDW
ParAccel
MapReduce
ETL
Splunk Alternative Initiative
Operation Monitoring Reporting
Initiative
Stream Kafka
Samza Stream
Processing
Stream A Stream B
Stream C
Notification Service
Mirror
.Net Stack
Java Stack
JVM stack
Vert.X stack
Node.js Stack Python Stack
Kafka Producer
Aspect
Aspect
Aggr ETL
ETL Kafka
Actions
Feeds
Production HadoopTableau
Elastic Search
Kibana
12. How to organize and build your team(s)
• Hiring vs. training smart developers in your organization
– Training
▫ Self-starters who can train themselves
▫ Online training that is free or with minimal cost
▫ Paid training for specific technologies
– Promote your technology and people will reach out to you
▫ Bit of a chicken and egg problem
• Key roles for the team
– Developers who understand operations
– Hadoop engineers
– Team leaders and managers
12
13. Big Data consultants
• Lots of them, charging lots of money
• Not all of them are created equal
• Prefer consultants who are vendor agnostic
• Find consultants who have experience in what you want
to do
• Check references
13
14. Companies working with custom logs
14
• Scribe, Scuba, Hive, and Hadoop as the data
warehouse infrastructure. Run over 10K Hive
scripts daily to crunch log data. Analyst on
each team to make sure logging is correct.
• Uses a very simple interface similar to log4j to
log data. How to keep this accurate?
• Tried Scribe. Implemented Kafka and Avro to
collect log data. Use a binary format with a
schema registry.
• Recently open sourced their log collecting
infrastructure (Suro – Data Pipeline).
“Used to be a web site that occasionally logged data. Now
we’re a logging engine that occasionally serves as a web
site.”
15. Collecting custom logs at Ancestry
• Framework piece with a “Logging Aspect”
– Logging is a cross cutting concern
– Avoid breaking changes
– Annotations for parameter names (normalization layer)
• Defined Big Data headers that must be present in every
log (User ID, Anonymous ID, Session, Request ID, Client)
– Stitch data together
– Partitioned in Hive by day/month/year
– JSON payload
– Validate messages sent vs. messages received
– Schema repository (long-term)
15
17. Ancestry log collection details
Each server
• 10 rolling logs
• Scraper process
Validate your data
collection infrastructure
• Auto incrementing count in every log
message
• Count on Framework side (sender) and
count on Hadoop (receiver)
17
Local Server
Hard Drive
Single Server
Kafka
Scrapper
10 rolling files
Hadoop
Log Sender
Log Receiver
18. Ancestry moving forward
• Ancestry is not “done” - the journey continues
– Still evolving and changing
– My thinking and understanding has also changed
• Means we will embrace new technologies in the future
– Keep our eyes open and experiment
• This is affecting the entire organization
– Becoming more involved with Open Source and the
communities that support it
18
19. Top three things to remember
• First and foremost, understand your needs
– No clear right or wrong way
– Keep it simple because simple scales
• This is about Analytics and impacting the business
• Find a company that fits you and follow them:
– Netflix (cloud architecture, code for survival, simian army)
– Facebook (HBase)
– LinkedIn (Kafka, Samza, Azkaban)
19