Creating a Data Science Team from an Architect's perspective. This is about team building on how to support a data science team with the right staff, including data engineers and devops.
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
Why use a datalake? Why use lambda? A conversation starter for Toronto Data Unconference 2015. We will discuss technologies such as Hadoop, Kafka, Spark Streaming, and Cassandra.
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
At Monsanto, emerging technologies such as IoT, advanced imaging and geo-spatial platforms; molecular breeding, ancestry and genomics data sets have made us rethink how we approach developing, deploying, scaling and distributing our software to accelerate predictive and prescriptive decisions. We created a Cloud based Data Science platform for the enterprise to address this need. Our primary goals were to perform analytics@scale and integrate analytics with our core product platforms.
As part of this talk, we will be sharing our journey of transformation showing how we enabled: a collaborative discovery analytics environment for data science teams to perform model development, provisioning data through APIs, streams and deploying models to production through our auto-scaling big-data compute in the cloud to perform streaming, cognitive, predictive, prescriptive, historical and batch analytics@scale, integrating analytics with our core product platforms to turn data into actionable insights.
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
Enterprise Holding’s first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, we’ll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, we’ll share some lessons learned with the pieces of our architecture worked well and rant about those which didn’t. No deep Hadoop knowledge is necessary, architect or executive level.
How to get started in Big Data without Big Costs - StampedeCon 2016StampedeCon
Looking to implement Hadoop but haven’t pulled the trigger yet? You are not alone. Many companies have heard the hype about how Hadoop can solve the challenges presented by big data, but few have actually implemented it. What’s preventing them from taking the plunge? Can it be done in small steps to ensure project success?
This session will discuss some of the items to consider when getting started with Hadoop and how to go about making the decision to move to the de facto big data platform. Starting small can be a good approach when your company is learning the basics and deciding what direction to take. There is no need to invest large amounts of time and money up front if a proof of concept is all you aim to provide. Using well known data sets on virtual machines can provide a low cost and effort implementation to know if your big data journey will be successful with Hadoop.
What is Big Data Discovery, and how it complements traditional business anal...Mark Rittman
Data Discovery is an analysis technique that complements traditional business analytics, and enables users to combine, explore and analyse disparate datasets to spot opportunities and patterns that lie hidden within your data. Oracle Big Data discovery takes this idea and applies it to your unstructured and big data datasets, giving users a way to catalogue, join and then analyse all types of data across your organization.
In this session we'll look at Oracle Big Data Discovery and how it provides a "visual face" to your big data initatives, and how it complements and extends the work that you currently do using business analytics tools.
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
Why use a datalake? Why use lambda? A conversation starter for Toronto Data Unconference 2015. We will discuss technologies such as Hadoop, Kafka, Spark Streaming, and Cassandra.
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
At Monsanto, emerging technologies such as IoT, advanced imaging and geo-spatial platforms; molecular breeding, ancestry and genomics data sets have made us rethink how we approach developing, deploying, scaling and distributing our software to accelerate predictive and prescriptive decisions. We created a Cloud based Data Science platform for the enterprise to address this need. Our primary goals were to perform analytics@scale and integrate analytics with our core product platforms.
As part of this talk, we will be sharing our journey of transformation showing how we enabled: a collaborative discovery analytics environment for data science teams to perform model development, provisioning data through APIs, streams and deploying models to production through our auto-scaling big-data compute in the cloud to perform streaming, cognitive, predictive, prescriptive, historical and batch analytics@scale, integrating analytics with our core product platforms to turn data into actionable insights.
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
Enterprise Holding’s first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, we’ll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, we’ll share some lessons learned with the pieces of our architecture worked well and rant about those which didn’t. No deep Hadoop knowledge is necessary, architect or executive level.
How to get started in Big Data without Big Costs - StampedeCon 2016StampedeCon
Looking to implement Hadoop but haven’t pulled the trigger yet? You are not alone. Many companies have heard the hype about how Hadoop can solve the challenges presented by big data, but few have actually implemented it. What’s preventing them from taking the plunge? Can it be done in small steps to ensure project success?
This session will discuss some of the items to consider when getting started with Hadoop and how to go about making the decision to move to the de facto big data platform. Starting small can be a good approach when your company is learning the basics and deciding what direction to take. There is no need to invest large amounts of time and money up front if a proof of concept is all you aim to provide. Using well known data sets on virtual machines can provide a low cost and effort implementation to know if your big data journey will be successful with Hadoop.
What is Big Data Discovery, and how it complements traditional business anal...Mark Rittman
Data Discovery is an analysis technique that complements traditional business analytics, and enables users to combine, explore and analyse disparate datasets to spot opportunities and patterns that lie hidden within your data. Oracle Big Data discovery takes this idea and applies it to your unstructured and big data datasets, giving users a way to catalogue, join and then analyse all types of data across your organization.
In this session we'll look at Oracle Big Data Discovery and how it provides a "visual face" to your big data initatives, and how it complements and extends the work that you currently do using business analytics tools.
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Apache Hadoop is quickly becoming the technology of choice for organizations investing in big data, powering their next generation data architecture. With Hadoop serving as both a scalable data platform and computational engine, data science is re-emerging as a center-piece of enterprise innovation, with applied data solutions such as online product recommendation, automated fraud detection and customer sentiment analysis. In this talk Ofer will provide an overview of data science and how to take advantage of Hadoop for large scale data science projects: * What is data science? * How can techniques like classification, regression, clustering and outlier detection help your organization? * What questions do you ask and which problems do you go after? * How do you instrument and prepare your organization for applied data science with Hadoop? * Who do you hire to solve these problems? You will learn how to plan, design and implement a data science project with Hadoop
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.
Joe Caserta presented Incorporating the Data Lake into Your Analytics Architecture.
For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.
Big Data in the Cloud - Montreal April 2015Cindy Gross
slides:
Basic Big Data and Hadoop terminology
What projects fit well with Hadoop
Why Hadoop in the cloud is so Powerful
Sample end-to-end architecture
See: Data, Hadoop, Hive, Analytics, BI
Do: Data, Hadoop, Hive, Analytics, BI
How this tech solves your business problems
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Software
You can watch the replay for this webcast in the IDERA Resource Center: http://ow.ly/QHaG50A58ZB
Many information technology professionals may not recognize it, but the bulk of their work has been and continues to be nothing more than database migrations. In the old days to share files across systems, then to move files into relational databases, then to load into data warehouses, and finally now we're moving to NoSQL and the cloud. In the presentation we'll delve into the ever growing and increasingly complex world of database migrations. Some of these considerations include what issues must be planned for and overcome, what problems are likely to occur, and what types of tools exist.
Database expert Bert Scalzo will cover these and many other database migration concerns.
About Bert: Bert Scalzo is an Oracle ACE, author, speaker, consultant, and a major contributor for many popular database tools used by millions of people worldwide. He has 30+ years of database experience and has worked for several major database vendors. He has BS, MS and Ph.D. in computer science plus an MBA. He has presented at numerous events and webcasts. His areas of key interest include data modeling, database benchmarking, database tuning, SQL optimization, "star schema" data warehousing, running databases on Linux or VMware, and using NVMe flash based technology to speed up database performance.
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016StampedeCon
Hadoop adoption is a journey. Depending on the business the process can take weeks, months, or even years. Hadoop is a transformative technology so the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. There are challenges for companies who have lived with an application driven business for the last two decades to suddenly become data driven. Companies need to begin thinking less in terms of single, silo’d servers and more about “the cluster”.
The concept of the cluster becomes the center of data gravity drawing all the applications to it. Companies, especially the IT organizations, embark on a process of understanding how to maintain and operationalize this environment and provide the data lake as a service to the businesses. They must empower the business by providing the resources for the use cases which drive both renovation and innovation. IT needs to adopt new technologies and new methodologies which enable the solutions. This is not technology for technology sake. Hadoop is a data platform servicing and enabling all facets of an organization. Building out and expanding this platform is the ongoing journey as word gets out to businesses that they can have any data they want and any time. Success is what drives the journey.
The length of the journey varies from company to company. Sometimes the challenges are based on the size of the company but many times the challenges are based on the difficulty of unseating established IT processes companies have adopted without forethought for the past two decades. Companies must navigate through the noise. Sifting through the noise to find those solutions which bring real value takes time. As the platform matures and becomes mainstream, more and more companies are finding it easier to adopt Hadoop. Hundreds of companies have already taken many steps; hundreds more have already taken the first step. As the wave of successful Hadoop adoption continues, more and more companies will see the value in starting the journey and paving the way for others.
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
Many organisations are creating groups dedicated to data. These groups have many names : Data Team, Data Labs, Analytics Teams….
But whatever the name, the success of those teams depends a lot on the quality of the data infrastructure and their ability to actually deploy data science applications in production.
In that regards a new role of “DataOps” is emerging. Similar, to Dev Ops for (Web) Dev, the Data Ops is a merge between a data engineer and a platform administrator. Well versed in cluster administration and optimisation, a data ops would have also a perspective on the quality of data quality and the relevance of predictive models.
Do you want to be a Data Ops ? We’ll discuss its role and challenges during this talk
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
Big data makes you a bit Confused ? messaging? batch processing? data streaming? in flight analytics? Cloud? open source? Flume? kafka? flafka (both)? SQS? kinesis? firehose?
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Apache Hadoop is quickly becoming the technology of choice for organizations investing in big data, powering their next generation data architecture. With Hadoop serving as both a scalable data platform and computational engine, data science is re-emerging as a center-piece of enterprise innovation, with applied data solutions such as online product recommendation, automated fraud detection and customer sentiment analysis. In this talk Ofer will provide an overview of data science and how to take advantage of Hadoop for large scale data science projects: * What is data science? * How can techniques like classification, regression, clustering and outlier detection help your organization? * What questions do you ask and which problems do you go after? * How do you instrument and prepare your organization for applied data science with Hadoop? * Who do you hire to solve these problems? You will learn how to plan, design and implement a data science project with Hadoop
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.
Joe Caserta presented Incorporating the Data Lake into Your Analytics Architecture.
For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.
Big Data in the Cloud - Montreal April 2015Cindy Gross
slides:
Basic Big Data and Hadoop terminology
What projects fit well with Hadoop
Why Hadoop in the cloud is so Powerful
Sample end-to-end architecture
See: Data, Hadoop, Hive, Analytics, BI
Do: Data, Hadoop, Hive, Analytics, BI
How this tech solves your business problems
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Software
You can watch the replay for this webcast in the IDERA Resource Center: http://ow.ly/QHaG50A58ZB
Many information technology professionals may not recognize it, but the bulk of their work has been and continues to be nothing more than database migrations. In the old days to share files across systems, then to move files into relational databases, then to load into data warehouses, and finally now we're moving to NoSQL and the cloud. In the presentation we'll delve into the ever growing and increasingly complex world of database migrations. Some of these considerations include what issues must be planned for and overcome, what problems are likely to occur, and what types of tools exist.
Database expert Bert Scalzo will cover these and many other database migration concerns.
About Bert: Bert Scalzo is an Oracle ACE, author, speaker, consultant, and a major contributor for many popular database tools used by millions of people worldwide. He has 30+ years of database experience and has worked for several major database vendors. He has BS, MS and Ph.D. in computer science plus an MBA. He has presented at numerous events and webcasts. His areas of key interest include data modeling, database benchmarking, database tuning, SQL optimization, "star schema" data warehousing, running databases on Linux or VMware, and using NVMe flash based technology to speed up database performance.
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016StampedeCon
Hadoop adoption is a journey. Depending on the business the process can take weeks, months, or even years. Hadoop is a transformative technology so the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. There are challenges for companies who have lived with an application driven business for the last two decades to suddenly become data driven. Companies need to begin thinking less in terms of single, silo’d servers and more about “the cluster”.
The concept of the cluster becomes the center of data gravity drawing all the applications to it. Companies, especially the IT organizations, embark on a process of understanding how to maintain and operationalize this environment and provide the data lake as a service to the businesses. They must empower the business by providing the resources for the use cases which drive both renovation and innovation. IT needs to adopt new technologies and new methodologies which enable the solutions. This is not technology for technology sake. Hadoop is a data platform servicing and enabling all facets of an organization. Building out and expanding this platform is the ongoing journey as word gets out to businesses that they can have any data they want and any time. Success is what drives the journey.
The length of the journey varies from company to company. Sometimes the challenges are based on the size of the company but many times the challenges are based on the difficulty of unseating established IT processes companies have adopted without forethought for the past two decades. Companies must navigate through the noise. Sifting through the noise to find those solutions which bring real value takes time. As the platform matures and becomes mainstream, more and more companies are finding it easier to adopt Hadoop. Hundreds of companies have already taken many steps; hundreds more have already taken the first step. As the wave of successful Hadoop adoption continues, more and more companies will see the value in starting the journey and paving the way for others.
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
Many organisations are creating groups dedicated to data. These groups have many names : Data Team, Data Labs, Analytics Teams….
But whatever the name, the success of those teams depends a lot on the quality of the data infrastructure and their ability to actually deploy data science applications in production.
In that regards a new role of “DataOps” is emerging. Similar, to Dev Ops for (Web) Dev, the Data Ops is a merge between a data engineer and a platform administrator. Well versed in cluster administration and optimisation, a data ops would have also a perspective on the quality of data quality and the relevance of predictive models.
Do you want to be a Data Ops ? We’ll discuss its role and challenges during this talk
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
Big data makes you a bit Confused ? messaging? batch processing? data streaming? in flight analytics? Cloud? open source? Flume? kafka? flafka (both)? SQS? kinesis? firehose?
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
At the StampedeCon 2015 Big Data Conference: Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance. Each of the data formats have different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, we have observed performance differences on the order of 25x between Parquet and Plain Text files for certain workloads. However, it isn’t the case that one is always better than the others.
La BuzzWord dell’ultimo anno è “Data Science”. Ma cosa significa realmente? Cosa fa un “Data Scientist”? Che strumenti sono messi a disposizione da Microsoft? E che altri strumenti ci sono oltre a Microsoft?
Lean Analytics is a set of rules to make data science more streamlined and productive. It touches on many aspects of what a data scientist should be and how a data science project should be defined to be successful. During this presentation Richard will present where data science projects go wrong, how you should think of data science projects, what constitutes success in data science and how you can measure progress. This session will be loaded with terms, stories and descriptions of project successes and failures. If you're wondering whether you're getting value out of data science, how to get more value out of it and even whether you need it then this talk is for you!
What you will take away from this session
Learn how to make your data science projects successful
Evaluate how to track progress and report on the efficacy of data science solutions
Understand the role of engineering and data scientists
Understand your options for processes and software
5 Things that Make Hadoop a Game Changer
Webinar by Elliott Cordo, Caserta Concepts
There is much hype and mystery surrounding Hadoop's role in analytic architecture. In this webinar, Elliott presented, in detail, the services and concepts that makes Hadoop a truly unique solution - a game changer for the enterprise. He talked about the real benefits of a distributed file system, the multi workload processing capabilities enabled by YARN, and the 3 other important things you need to know about Hadoop.
To access the recorded webinar, visit the event site: https://www.brighttalk.com/webcast/9061/131029
For more information the services and solutions that Caserta Concepts offers, please visit http://casertaconcepts.com/
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
The Right Data Warehouse: Automation Now, Business Value ThereafterInside Analysis
The Briefing Room with Dr. Robin Bloor and WhereScape
Live Webcast on April 1, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=7b23b14b532bd7be60a70f6bd5209f03
In the Big Data shuffle, everyone is looking at Hadoop as “the answer” to collect interesting data from a new set of sources. While Hadoop has given organizations the power to gather more information assets than ever before, the question still looms: which data, regardless of source, structure, volume and all the rest, are significant for affecting business value – and how do we harness it? One effective approach is to bolster the data warehouse environment with a solution capable of integrating all the data sources, including Hadoop, and automating delivery of key information into the rights hands.
Register for this episode of The Briefing Room to hear veteran Analyst Robin Bloor as he explains how a rapidly changing information landscape impacts data management. He will be briefed by Mark Budzinski of WhereScape, who will tout his company’s data warehouse automation solutions. Budzinski will discuss how automation can be the cornerstone for closing the gap between those responsible for data management and the people driving business decisions.
Visit InsideAnlaysis.com for more information.
Dapper: the microORM that will change your lifeDavide Mauri
ORM or Stored Procedures? Code First or Database First? Ad-Hoc Queries? Impedance Mismatch? If you're a developer or you are a DBA working with developers you have heard all this terms at least once in your life…and usually in the middle of a strong discussion, debating about one or the other. Well, thanks to StackOverflow's Dapper, all these fights are finished. Dapper is a blazing fast microORM that allows developers to map SQL queries to classes automatically, leaving (and encouraging) the usage of stored procedures, parameterized statements and all the good stuff that SQL Server offers (JSON and TVP are supported too!) In this session I'll show how to use Dapper in your projects from the very basis to some more complex usages that will help you to create *really fast* applications without the burden of huge and complex ORMs. The days of Impedance Mismatch are finally over!
How to build your own Delve: combining machine learning, big data and SharePointJoris Poelmans
You are experiencing the benefits of machine learning everyday through product recommendations on Amazon & Bol.com, credit card fraud prevention, etc… So how can we leverage machine learning together with SharePoint and Yammer. We will first look into the fundamentals of machine learning and big data solutions and next we will explore how we can combine tools such as Windows Azure HDInsight, R, Azure Machine Learning to extend and support collaboration and content management scenarios within your organization.
We recently presented our technology solution for metadata discovery to the Boulder Business Intelligence Brains Trust in Colorado. (www.bbbt.us)
The whole session was also videod and there is a link to the recording at the end of the presentation.
Journey of The Connected Enterprise - Knowledge Graphs - Smart DataBenjamin Nussbaum
We live in an era where the world is more connected than ever before and the trajectory is such that data relationships will only continue to increase with no signs of slowing down.
Connected data is the key to your business succeeding and growing in today’s connected world.
Leading enterprises will be the ones that utilize relationship-centric technologies to leverage connections from their internal operations and supply chain to their customer and user interactions. This ability to utilize connected data to understand all the nuanced relationships within their organization will propel them forward as they act on more holistic insights.
Every organization needs a knowledge graph because connected data is an essential foundation to advancing business. Knowledge graphs provide:
- Increased visibility between internal groups
- Efficiency gains
- Cross-functional data collaboration
- Core complete and reliable business insights
- Better customer engagement
The live presentation and discussion can be found here: https://www.youtube.com/watch?v=RQGdw82rAes
Additional reading on why connected data is beneficial: https://www.graphgrid.com/why-connected-data-is-more-useful/
Connected data solutions available by Benjamin and his team via GraphGrid and AtomRain: https://www.graphgrid.com and https://www.atomrain.com
Business in the Driver’s Seat – An Improved Model for IntegrationInside Analysis
The Briefing Room with Dr. Robin Bloor and WhereScape
Live Webcast on September 30, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=bfff40f7c9645fc398770ea11152b148
The fueling of information systems will always require some effort, but a confluence of innovations is fundamentally changing how quickly and accurately it can be done. Gone are long cycle times for development. Today, organizations can embrace a more rapid and collaborative approach for building analytical applications and data warehouses. The key is to have business experts working hand-in-hand with data professionals as the solutions take shape, thus expediting the speed to valuable insights.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains the changing nature of information design. He’ll be briefed by WhereScape President Mark Budzinski, who will discuss his company’s data warehouse automation solutions and how they enable collaborative development. He will share use cases that illustrate show aligning business and IT, organizations can enable faster and more agile data warehouse development.
Visit InsideAnlaysis.com for more information.
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
Data can be viewed as the exhaust of online activity. With the rise of cloud-based data platforms, barriers to data storage and transfer have crumbled. The demand for creative applications and learning from those datasets has accelerated. Rapid acceleration can quickly accrue disorder, and disorderly data design can turn the deepest data lake into an impenetrable swamp.
In this talk, I will discuss the evolution of the data science workflow at Expedia with a special emphasis on Learning to Rank problems. From the heroic early days of ad-hoc Spark exploration to our first production sort model on the cloud, we will explore the process of industrializing the workflow. Layered over our story, I will share some best practices and suggestions on how to keep your data productive, or even pull your organization out of the data swamp.
Synapse is a solution provider with an innovative alternative to commercial off-the-shelf IT applications. Empowering business professionals to shape business processes without being chained to IT applications.
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
Paytm Labs provides a quick overview of their Hadoop data ingest platform. We cover our journey from a batch focused ingest system with SQOOP to a streaming ingest supported by Kafka, Confluent.io, Hadoop, Cassandra, and Spark Streaming. This presentation also provides an overview of our complete data platform including our feature creation template
An overview of securing Hadoop. Content primarily by Balaji Ganesan, one of the leaders of the Apache Argus project. Presented on Sept 4, 2014 at the Toronto Hadoop User Group by Adam Muise.
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitectureAdam Muise
An introduction to Hadoop's core components as well as the core Hadoop use case: the Data Lake. This deck was delivered at Big Data Congress 2014 in Saint John, NB on Feb 24.
What is Hadoop brief intro for Georgian Partners CTO Conference. This outlines the origins of Open Source Apache Hadoop and how Hortonworks fits into this picture. There is also a brief introduction to YARN, the new resource negotiation layer.
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
HBase Technical Introduction. This deck includes a description of memory design, write path, read path, some operational tidbits, SQL on HBase (Phoenix and Hive), as well as HOYA (HBase on YARN).
1. So you want to data
science.
Adam Muise
Chief Architect
2. Who am I?!
• Chief Architect at Paytm Labs!
• Paytm Labs is a data-driven lab founded to take on
the really hard problems of scaling up Fraud,
Recommendation, Rating, and Platform at Paytm!
• Paytm is an Indian Payments/Wallet company, has
50 Million wallets already, adds almost 1 Million
wallets a day, and will be greater than 100 Million
customers by the end of the year. Alibaba recently
invested in us, perhaps you heard. !
• I’ve also worked with Data Science teams at IBM,
Cloudera, and Hortonworks!
7. The Leadership!
If you are creating a data science
team, chances are that you are not a
Data Scientist. Data Scientists are
best applied to the problems of data,
not management.!
8. The Leadership!
Your boss (should ask): Why do you
even data science to solve the problem?!
You (should) answer: The problem is too
complex to solve without machine
learning. Here’s why.!
You (should not) answer: Big data and
data science is on the roadmap.!
9. The Leadership!
You have your budget for a team of 2
data scientists. That’s a good start
right? Get ready to ask for more
money. !
10. The Leadership!
You need to ask your management for:!
- Budget for 2 data engineers for every data scientist you hire!
- Access to the data lake, failing that, access the data warehouse!
- DevOps!
- Time to gain domain expertise before producing results!
- Exec-level cooperation from those teams who own the data and
tools you need and those who understand the data you need!
- A budget for servers/tools/additional storage based on a TCO
calculation you already did (right?)!
- A dedicated place for your team to work!
11. The Leadership!
Got DataLake?!
!
No? Depending on your
problem space,
chances are you are
building one unless you
can pull what you need
from an Existing Data
Warehouse.!
12. The Leadership!
You didn’t do a TCO (Total Cost of Ownership) calculation?
Ok, here you go:!
1. Internal/External cloud instances that can run Spark/
Hadoop/etc!
2. Storage costs (S3, internal, etc) for your analytical data
sets!
3. Lead time to get started, something like 1-2 months
depending on the complexity of the problem (Fraud
might take 3 months whereas Recommendation Engines
might be 1 month)!
4. Training time and costs for tools you didn’t know you
needed!
What! How much!
24-32 medium to large
instances on AWS each
month!
$15,000 to $45,000 per
month!
Storage costs for S3 (400TB
to 2PB)!
$12,000 to $57,000 per
month!
Salaries & Operating
Expenses!
2 x $xxxxx your operating
costs including salaries for
yourself and 3 people!
Training!
(Courses for Tools and
perhaps a conference trip
for hiring)!
$5,000 to $15,000!
14. The Team!
So you have permission, resources,
and a corner in an office. How do you
start? !
15. The Team!
Assemble your team in the following
order:!
1. Get a Data Engineer with a good
analytical mind. Have him beg,
borrow, or steal whatever data sets
that might be applicable to the
problem. Without data, no data
sciencey stuff can happen.!
16. The Team!
Assemble your team in the
following order:!
2. While you are getting
your data, hire or recruit
an internal Data Scientist. !
Easy, right?!
17. !!!!!!WARNING!!!!!!!
Data Science is not a mystical art form handed down by monks and taught over
50 years. You just need:!
• a good math background!
• academic or job experience with machine learning !
• business context!
• understand how to code. !
That can be easier to find than you think. !
!
That being said, everybody seems to think they are data scientists these days,
from the guy who writes the monthly SQL reports to your office manager who is a
wiz at excel. !
18. The Team!
Assemble your team in the following
order:!
3. More Data Engineers. !
4. DevOps support (if you don’t have
a common resource pool to draw
from).!
19. The Team!
Keep your data science team innovative, keep
them away from bureaucracy, keep them cool.
Don’t discount the cool factor.!
They are supposed to solve hard problems, not
deal with the everyday business issues. To
objective they need to be decoupled from the
emergencies and mediocre. !
If that sounds elitist then I challenge you to
create a scaling fraud detection system with your
existing data warehouse team. No really, try it. !
20. The Team!
What will they do?!
The Data Engineer !
Your data engineer is the heart and sole of your data science
team and will get almost none of the credit in the end. They
will help build your data pipeline, perform data
transformations, optimize training, automate validation, and
take the results into production. !
If you are lucky, you have Data Scientists that respect this
role and will often take some of these roles on to help ensure
their vision reaches production. Instead of relying on luck,
you can hire this way too. !
21. The Team!
What will they do?!
The Data Scientist!
Your Data Scientist will explore the data, create models, validate,
explore the data again, go in a different direction, clarify
requirements, model again, validate, retract, and then produce a
good model. The process is not deterministic and is a mix of
research and implementation. A good Data Scientist will be able to
code in the tools that you intend to go implement production code
with, something like Scala in Spark. !
Your Data Scientist will have or at least learn the business context
required to solve your problem. They will need to communicate with
business experts to validate their solutions actually solve the
problem or to help drive them in a new direction. !
22. The Team!
What will they do?!
DevOps!
Developer Operations will help
build that data pipeline for you. If
you have to build a Data Lake from
scratch, you are going to really rely
on these folks. They should be
elite, understand distributed
systems, ride a motorcycle, and be
someone you feel uncomfortable
standing next to in an elevator.!
23. Managing The Team!
If your Data Scientists are not stellar
coders, put a Data Engineer in their
grill and make them produce code.
They can’t contribute if they can’t get
their hands dirty. Data Science is not
an ivory tower. !
24. Managing The Team!
Introduce your team to the
business team that knows the
data or business processes
better than anyone else. Often
that’s not the CIO-favored DWH
team, but rather the Customer
Service Representatives*!
*This was especially true in fighting Fraud. !
25. Managing The Team!
Ways to make your team hate you:!
Data Scientists:!
• Don’t provide the data they need to create their models!
• Suggest that they create their own training data, from scratch!
• Provide ambiguous goals for the accuracy and precision of their models!
• Tell them to mine the data / don’t’ have a plan!
• Don’t respect the time it takes to create a model!
Data Engineers:!
• Let the Data Scientists use whatever tool they want without respect to parallel processing or
implementation!
• Have no management control over your data sources!
DevOps:!
• Use anything by IBM, Microsoft, SAS, or Oracle in your pipeline!
• Let the Data Engineers decide on the infrastructure!
27. The Work!
Start out with a clear that is
unambiguous. !
“I want to detect and prevent 50% of
Fraud in my payments system”!
“I want to increase conversion rates in
my eCommerce platform by 20%”!
28. The Work!
Get as much of the raw data as soon as you can
and as fast as you can. Don’t have a Data Lake?
Get your Hadoop on ASAP. !
!
29. The Work!
Give the team time to research the
data, gain context and become
experts. !
!
30. The Work!
Data without context == a complete
lack of direction in research. !
Research needs constant checks to
ensure that the primary problem is
being solved. !
!
31. The Work!
Data Science Development !=
Engineering Software Development.!
You will have to separate your
research process from the
engineering process that delivers the
models to production. !
!
32. The Work!
Data Engineering is an ongoing
process. You will need to maintain
pipelines, adapt to schema changes,
implement data cleansing, maintain
metadata in the data lake, optimize
processing workflows, etc. You will
never outgrow the need for your Data
Engineers. !
!
34. The Architecture!
Start with the cloud. You need to get
your infrastructure up as quickly as
possible. At the beginning, this is
cheaper than you think compared the
time and startup costs for creating an
on-premise data lake, even/especially if
you have an existing IT Team*!
!
*If you are big corporation your IT team is often the biggest barrier to your success in
creating an independent Data Science team.!
36. The Architecture!Lambda Architecture!
Batch Ingest:!
• SQOOP from MySQL instances!
• Keep as much in HDFS as you can, offload to S3 for
DR/Archive and when you have colder data!
• Spark and other Hadoop processing tools can run
natively over S3 data so it’s never really gone (don’t
use Glacier in a processing workflow)!
Realtime Ingest:!
• Mypipe to get events from binary log data and push
into Kafka topics (under construction)!
• VoltDB connector to get events from DB and push to
Kafka (under construction)!
• Streaming data piped through Kafka!
• All Realtime data processed with Spark Streaming or
Storm from Kafka!
37. The Architecture!
As you grow, your processing and
storage needs will likely mature.
Consider moving to on-premise
solution for your Hadoop/Processing
architecture. You can always archive
to S3 if you need DR and don’t have
the appetite to create two clusters.!
38. The Architecture!
With an on-premise architecture, you
can interact with existing on-premise
production systems quickly. For us,
that means real-time Fraud detection
and action. You may find yourself
maintaining both in the long run.!
40. armando@paytm.com - @jabenitez
Supervised learning vs Anomaly detection
๏ Very small number of positive
examples
๏ Large number of negative examples.
๏ Many different “types” of anomalies.
Hard for any algorithm to learn from
positive examples what the
anomalies look like; future anomalies
may look nothing like any of the
anomalous examples we’ve seen so
far.
40
๏ Ideally large number of positive and
negative examples.
๏ Enough positive examples for
algorithm to get a sense of what
positive examples are like, future
positive examples likely to be similar
to ones in training set.
* Anomaly Detection - Andrew Ng - Coursera ML Course
41. armando@paytm.com - @jabenitez
What approach to follow?
๏ Not so good: One model to rule them all
๏ Better:
๏ Many models competing against each other
๏ 100s or 1000s of rules running in parallel
๏ Know thy customer
41
42. armando@paytm.com - @jabenitez
Feature Selection
๏ Want
p(x) large (small) for normal examples, "
p(x) small (large) for anomalous examples
๏ Most common problem: "
comparable distributions for both normal and anomalous examples
๏ Possible solutions:
๏ Apply transformation and variable combinations:
๏ xn+1 = ( x1 + x4 ) 2 / x3
๏ Focus on variable ratios and transaction velocity
๏ Use deep learning for feature extraction
๏ Dimensionality reduction
๏ your solution here
42
45. armando@paytm.com - @jabenitez
What have we have tried
๏ Density estimator
๏ 2D Profiles
๏ Anomaly detection
๏ Clustering
๏ Model ensemble (Random forest)
๏ Deep learning (RBM)
๏ Logistic Regression
45
Combine
47. armando@paytm.com - @jabenitez
Anomaly Detection* - Example
๏ Choose features, xi , that are indicative of anomalous examples.
๏ Fit parameters to a normal distribution
๏ Given new example, compute:
๏ Anomaly if
47
* Anomaly Detection - Andrew Ng - Coursera ML Course
48. armando@paytm.com - @jabenitez
Algorithm Evaluation
๏ Fit model on training set
๏ On a cross validation/test example, predict
๏ Possible evaluation metrics:
๏ True positive, false positive, false negative, true negative
๏ Precision/Recall
๏ F1-score
48
50. armando@paytm.com - @jabenitez
Anomaly Detection*
50
* Anomaly Detection - Andrew Ng - Coursera ML Course
Cross validation set:
Test set:
Assume we have some labeled data, of
anomalous and non-anomalous
examples: y = 0 if standard
behaviour, . y = 1 if
anomalous.
Training set: "
(assume normal examples/not anomalous)
54. armando@paytm.com - @jabenitez
The lake again
54
Lake Simcoe
going on
Lake Superior
Classic Lambda
Architecture
Various
Processing
Frameworks
Near-Realtime
Scoring/Alerting*
55. armando@paytm.com - @jabenitez
Fraud Capabilities and Technology
A. Batch Ingest and Analysis of
transaction data from Database
B. Batch Behavioural and Portfolio
heuristic fraud detection
C. Near-realtime anomaly and
heuristic fraud detection
D. Online Model Scoring
55
A. Traditional ETL tools for transfer, HDFS/S3 for
storage, Spark for processing
B. Model analysis with iPython/Scala Notebook,
Spark for processing, HDFS/HBase/Cassandra
for storage
C. Kafka real-time ingest, introduce Storm/Spark
Streaming for near-realtime interception of
data, HBase for model/rule storage and lookup
D. JPMML/Spark Streaming for realtime model
scoring