Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
©latitude51north.com
Strata+Hadoop World
London 2015
May 2015
©latitude51north.com
{“about” : “me”}
Harvinder Atwal
2
MoneySuperMarket.com
Web
@harvindersatwal
latitude51north
dunnhumb...
©latitude51north.com
What is Strata+Hadoop World?
3
Presented by O’Reilly and Cloudera, Strata + Hadoop World is where big...
©latitude51north.com
The quality of presentations was mixed. Here
are my favourite keynotes.
1. Shazam - http://www.youtub...
©latitude51north.com
The most talked about technology at Strata +
Hadoop World was…
• Naturally, ‘Hadoop’ itself comes out...
©latitude51north.com
My Key take-aways
• Apache Foundation Open-Source Software has become the industry standard
for Big D...
©latitude51north.com
Key theme 1 – Apache
Foundation tech
©latitude51north.com
The Apache ecosystem for Big Data is growing rapidly
and it’s getting confusing!
8
©latitude51north.com
So first a history lesson…
• In the early 2000s Google was finding it challenging to store and proces...
©latitude51north.com
Wordcount is the canonical example for
MapReduce
10
By processing in parallel you overcome the limits...
©latitude51north.com
Google’s papers lead to the development of
Open Source implementations
• Inspired by Google’s MapRedu...
©latitude51north.com
The Enterprise Hadoop market is now big
business
12
Three vendors dominate the market.
All have simil...
©latitude51north.com
is a complete game changer
• Spark is an engine for large-scale data processing that seems to be in t...
©latitude51north.com
Spark has its own ecosystem extending
usability
• Spark is capable of working with data stored inside...
©latitude51north.com
I’ve been trying out Spark via the Python
API and can confirm it’s fast
15
ANALYSING LOG DATA TO COUN...
©latitude51north.com
HBase and Cassandra were the most talked
about databases for big data
• HBase is an open source, non-...
©latitude51north.com
Apache Drill looks exciting; use SQL to query
multiple NoSQL data sources
• Apache Drill is a Schema-...
©latitude51north.com
Key theme 2 – Enterprise Data
Hubs, Data Lakes (or Data
Swamps).
©latitude51north.com
There’s a trend away from fixed to flexible
data structures
19
• Traditionally data and code has been...
©latitude51north.com
Flexible data structures try and overcome
traditional limitations
• The schema on read approach takes...
©latitude51north.com
Schema-on-Read is one of the key differences between
a Data Warehouse and Data Lake
• A data lake is ...
©latitude51north.com
Cloudera’s compromise is to hold data in different layers – an approach
adopted by CTM, Goldman Sachs...
©latitude51north.com
Further reading
• Information architecture for Apache Hadoop - Mark Samson (Cloudera)
• http://cdn.or...
©latitude51north.com
Lambda architecture is the standard for
processing streaming Big Data
• Lambda architecture is a data...
©latitude51north.com
Useful tools
©latitude51north.com
Ivory looks like an incredibly useful package for speeding
up the modelling pipeline
• Ivory is a pac...
©latitude51north.com
Machine Learning
©latitude51north.com
The future of machine intelligence and why it matters -
Shivon Zilis (Bloomberg Beta)
• How machine l...
©latitude51north.com
Forecasting space-time events – Predictive policing,
Minority Report style
• This session uses the sp...
©latitude51north.com
Data Visualisation and
prototyping
©latitude51north.com
PDF is the second biggest religion in the UK
and other amusing insights..
• Visualizing the world's l...
©latitude51north.com
Accenture gave a great talk on the case for
building an in-house data insights lab
• Accenture talked...
©latitude51north.com
Other presentations
©latitude51north.com
Advanced Machine Learning
• Deploying machine learning in production What could possibly go wrong ?- ...
©latitude51north.com
Data Science
• The curiosity advantage: the most important skill for data science
• Curiosity is one ...
©latitude51north.com
Other
• Using Data for evil
• http://www.slideshare.net/DuncanRoss1/using-data-for-evil-2
• Apache Fl...
©latitude51north.com
Resources
• Speakers and slides
• http://strataconf.com/big-data-conference-uk-2015/public/schedule/s...
Upcoming SlideShare
Loading in …5
×

Strata+Hadoop World London 2015

494 views

Published on

Presented by O’Reilly and Cloudera, Strata + Hadoop World is where big data, cutting-edge data science, and new business fundamentals intersect—and merge.

This is my guide to the biggest themes and presentations from Strata+Hadoop World London 2015.

Published in: Data & Analytics
  • Be the first to comment

Strata+Hadoop World London 2015

  1. 1. ©latitude51north.com Strata+Hadoop World London 2015 May 2015
  2. 2. ©latitude51north.com {“about” : “me”} Harvinder Atwal 2 MoneySuperMarket.com Web @harvindersatwal latitude51north dunnhumby • previous : Insight Director, Tesco Clubcard Lloyds Banking Group • previous : Senior Manager, Customer Strategy and Insight • current : Head of Customer Insight and Marketing Optimisation harvindersatwal www.latitude51north.com harvinder.s.atwal@gmail.com
  3. 3. ©latitude51north.com What is Strata+Hadoop World? 3 Presented by O’Reilly and Cloudera, Strata + Hadoop World is where big data, cutting- edge data science, and new business fundamentals intersect—and merge. • The first day involved software tutorials and deep-dive’s, relating mostly to software in the Hadoop ecosystem, with many given by the software authors or contributors. This provides an excellent opportunity to take a closer look at a particular technology and ask in-depth questions of the people in the know. • On the second day, the conference proper starts and despite there now being four Strata + Hadoop World conferences a year, offers a packed schedule of speakers from many of the industry’s leading organisations. Speakers this year including people from Barclay’s Bank, Google, CERN, Accenture, Pivotal, Databricks, Dato, MapR, comparethemarket.com and a great many more.
  4. 4. ©latitude51north.com The quality of presentations was mixed. Here are my favourite keynotes. 1. Shazam - http://www.youtube.com/embed/mcTPvxo8SXY?autoplay=1 2. Ideas that Matter - We're always talking about "innovation", but - says Tim Harford - there are really two very different kinds of innovation. Using stories from sports, science, music, and military history, Tim will make you think different about where good ideas come from and how they should be encouraged. http://www.youtube.com/embed/ohCavVVxX0M?autoplay=1 3. Is Privacy Becoming a Luxury Good? Julia Angwin (ProPublica) We are being watched – by companies, by the government, by our neighbors. Technology has made powerful surveillance tools available to everyone. And now some of us are investing in counter- surveillance techniques and tactics. http://www.youtube.com/embed/fsWAZIfqPuU?autoplay=1 4. Overview of BT's internal multi-tenant hadoop platform - Phil Radley (Chief Data Architect at British Telecom) gave an overview of BT's internal multi-tenant hadoop platform. He explains their first production use case (master data management of BT UK Business Customer data) and gives a flavour of their use case pipeline. https://youtu.be/YMoVShk5D 5. Julie Meyer (Ariadne Capital) - http://www.youtube.com/embed/a8u- bOoqYA4?autoplay=1 4
  5. 5. ©latitude51north.com The most talked about technology at Strata + Hadoop World was… • Naturally, ‘Hadoop’ itself comes out on top with Spark coming in a close second. • Both Hadoop and Spark are frameworks to process very large datasets on commodity computer clusters. • Hadoop, Spark and many of the other most talked about technologies (Hive, HDFS, Kylin, HBase, etc.) are Apache Open Source Software Foundation projects. • The Foundation is now responsible for most of the developments in Big Data tech. • Scala is a relatively new programming language gaining rapid traction especially for productionised machine learning applications. It runs on the Java Virtual Machine and is interoperable with Java libraries. However, it is far less verbose and easier to code than Java. It also fully supports functional programming which is very fashionable. 5 A plot produced using a quick scoring algorithm run on data from the Strata session outlines. It clearly shows the most talked about technologies from the conference. (NB. SAS ranks high only because they were conference sponsors and presenters. Not because they were well talked about Tech!)
  6. 6. ©latitude51north.com My Key take-aways • Apache Foundation Open-Source Software has become the industry standard for Big Data processing, storage, and increasingly querying and analysis. • Some examples you may have heard of: Hadoop, Spark, Cassandra, HBase, Kafka. • Spark is likely to supplant Hadoop as the Big-Data processing platform of choice. • Data Lakes and how to deal with large quantities of streaming data are two hot topics in architecture • A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. They enable greater agility and range of applications as the raw data is always available. • Lambda architecture is the common solution to processing large quantities of streaming data • There were several Tools and Techniques we should explore further. Ivory looks very useful: • Ivory is an open source package that can speed up model building and analysis by turning raw Event data (e.g. customer enquiries) into a summarisation at a point in time by entity (enquiries by Customer in previous 12 months at 31 May 2015). 6
  7. 7. ©latitude51north.com Key theme 1 – Apache Foundation tech
  8. 8. ©latitude51north.com The Apache ecosystem for Big Data is growing rapidly and it’s getting confusing! 8
  9. 9. ©latitude51north.com So first a history lesson… • In the early 2000s Google was finding it challenging to store and process the exploding volume of content on the Internet. • Sanjay Ghemawat and Jeffrey Dean, senior researchers at Google, wrote a series of seminal papers that defined the way Google and everyone else since cracked the problem. • In order to cope, Google invented a novel style of data processing known as MapReduce, a new way of saving data called the Google File System (GFS), and an original way to store data, BigTable, a Distributed Database. 9 “Google is living a few years in the future and sends the rest of us messages,” Doug Cutting, Hadoop founder
  10. 10. ©latitude51north.com Wordcount is the canonical example for MapReduce 10 By processing in parallel you overcome the limits of a single machine and can scale by simply adding more nodes to the cluster
  11. 11. ©latitude51north.com Google’s papers lead to the development of Open Source implementations • Inspired by Google’s MapReduce paper Doug Cutting developed Hadoop at Yahoo. • For the first time it enabled companies to process huge quantities of data on cheap commodity hardware. • GFS inspired HDFS, Hadoop’s Distributed File System. • BigTable inspired HBase a non-relational database able to host millions of columns and billions of rows. • Many other applications have since been developed to build the Hadoop Ecosystem: • Pig – A language for querying using Hadoop • Hive – a Hadoop layer for querying using SQL-like language • Mahout – A machine learning library • However, despite abstractions programming in Hadoop is still not straightforward. 11
  12. 12. ©latitude51north.com The Enterprise Hadoop market is now big business 12 Three vendors dominate the market. All have similar offerings. MapR’s solution is the best if you want to try out Hadoop/Spark for yourself: http://doc.mapr.com/display/MapR/Home
  13. 13. ©latitude51north.com is a complete game changer • Spark is an engine for large-scale data processing that seems to be in the process of replacing the MapReduce paradigm. What seems to be driving Spark’s adoption at the moment is its raw speed. It claims speed increases of up to 100 times over in-memory MapReduce and 10 times for on-disk. • Like MapReduce, it works with the filesystem to distribute your data across the cluster, and process that data in parallel. However, Spark tries to keep things in RAM memory (fast), whereas MapReduce keeps shuffling things in and out of disk (slow). • Spark is also much more powerful and expressive in terms of how you give it instructions to crunch data, abstracting away a lot of complexity and allowing more interactive analysis of data. • Spark 1.4 will feature first class support and integration with R. With version 1.4, the SparkR project will be officially integrated, which means R will join Java, Scala and Python as a fully supported language. • http://cdn.oreillystatic.com/en/assets/1/event/126/Apache%20Spark_%20What_s%20new_%20 what_s%20coming%20Presentation.pdf 13
  14. 14. ©latitude51north.com Spark has its own ecosystem extending usability • Spark is capable of working with data stored inside a Hadoop cluster, can use data stored in Amazon’s S3 and can work with data stored locally, which means it’s really easy to experiment with. • You can perform interactive analysis of large datasets without sampling and have the same architecture for insight and production. • Spark can be used for analysing live Streaming data (web log, sensors, social media, etc.) using the same API as batch data. • The Spark ecosystem features MLlib a scalable machine learning library. • GraphX is a component for social network analysis, fraud detection, recommendations and other graph analysis. • Spark is also useful for less glamorous jobs like ETL. 14
  15. 15. ©latitude51north.com I’ve been trying out Spark via the Python API and can confirm it’s fast 15 ANALYSING LOG DATA TO COUNT THE NUMBER UNIQUE HOSTS PER DAY MapR’s Sandbox is the best way to get started with Spark https://www.mapr.com/blog/getting-started-spark-mapr- sandbox#.VZOxjPlViPw Spark can be also be used with Vertica http://www.vertica.com/wp-content/uploads/2013/12/Vertica_MapR_Solution-Brief_Feb- 2014.pdf Although Spark abstracts away some of the complexity of Hadoop the limited number of Actions and Transformations available still require a shift in mindset and more steps compared to SQL and SAS.
  16. 16. ©latitude51north.com HBase and Cassandra were the most talked about databases for big data • HBase is an open source, non-relational, distributed database modelled after Google's BigTable. Apache Impala is commonly used to query HBase data • Apache Cassandra was initially developed at Facebook to power their Inbox Search feature by Avinash Lakshman (one of the authors of Amazon's Dynamo) and Prashant Malik. Cassandra is essentially a hybrid between a key-value and a column-oriented (or tabular) database. • Apache Accumulo is a key/value store based on Google BigTable. • Hypertable is another open source database inspired by you-know- who. Because Hypertable keeps data physically sorted by a primary key it lends itself to applications that require fast access to ranges of data (e.g. analytics, sorted URL lists, messaging applications, etc.) • The big news at Strata though was that Google has made BigTable itself publicly available through Google Cloud. Because they started it all they claim it’s faster and better than everything else. 16
  17. 17. ©latitude51north.com Apache Drill looks exciting; use SQL to query multiple NoSQL data sources • Apache Drill is a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage • Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop. 17 Good example here of Drill usage: https://www.mapr.com/blog/apache-drill-its-drilliant-query-json-files-tableau-desktop Query multiple data types and sources Using SQL Drill even connects to Tableau for visualisation of output
  18. 18. ©latitude51north.com Key theme 2 – Enterprise Data Hubs, Data Lakes (or Data Swamps).
  19. 19. ©latitude51north.com There’s a trend away from fixed to flexible data structures 19 • Traditionally data and code has been tightly coupled to a schema to support specific applications. • This causes serious problems when you realise you have a new application for your data but your original schema won’t support it. • The ADM and MDM in MSM are very good examples of this problem. Both are designed for specific use cases (Insight and Campaigns). • If new data is required by the use cases then expensive development is required. • A lot of campaign data useful for insight is not available in the ADM and vice-versa for insight. Schema on write is the traditional approach to processing and storing data. Data goes through an ETL process to make it uniform and fit the predefined schema or it’s dropped.
  20. 20. ©latitude51north.com Flexible data structures try and overcome traditional limitations • The schema on read approach takes the same raw data but lands it (relatively unprocessed) all in the same place. • Then instead of building a series of applications on top of custom schemas you make the data dynamically available for various services through code. • This is a very different way to use data but provides much more agility. • As data users we can get our head round this but it will appear completely upside down (possibly insane) to traditional data architects. 20 Schema-on-read keeps the data in raw format. A schema is only applied as you decide how to use the data through code.
  21. 21. ©latitude51north.com Schema-on-Read is one of the key differences between a Data Warehouse and Data Lake • A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. • There are advantages and disadvantages to Data Lakes • Pros: • Agility – Data is always available for any use case. Moves us away from a data-centric to use-centric view. • Value – Improves data discovery and advanced analytics capability. • Cons: • Onus on user – Puts pressure on user to understand raw data and write more code. • Skills – Exploiting raw data requires an advanced skillset. 21
  22. 22. ©latitude51north.com Cloudera’s compromise is to hold data in different layers – an approach adopted by CTM, Goldman Sachs and others • Raw data always available and readable by Spark, Drill etc. • An enriched view available to Analysts and Data Scientists using Spark, Impala, Drill etc. • Shared layer available to business with multiple sources joined together. • Optimised Layer used to operationalise the use cases and organised by data consumer not source. Optimised for performance. • A speed layer following lambda architecture principles for real- time analysis. • Can be more, or fewer layers. 22 Raw Layer Discovery Layer Shared Layer Data Sources Data Consumers Optimised Layer SpeedLayer
  23. 23. ©latitude51north.com Further reading • Information architecture for Apache Hadoop - Mark Samson (Cloudera) • http://cdn.oreillystatic.com/en/assets/1/event/126/Information%20architecture%20f or%20Apache%20Hadoop%20Presentation.pptx • It ain’t what you do to data, it’s what you do with it (Silicon Valley Data Science) • http://cdn.oreillystatic.com/en/assets/1/event/126/It%20ain%E2%80%99t%20what% 20you%20do%20to%20data,%20it%E2%80%99s%20what%20you%20do%20with%20i t%20Presentation.pptx • http://svds.com/art-abstraction • Systems that enable agility • https://speakerdeck.com/ept/systems-that-enable-data-agility 23
  24. 24. ©latitude51north.com Lambda architecture is the standard for processing streaming Big Data • Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. • Nathan Marz designed this generic architecture addressing common requirements for big data based on his experience working on distributed data processing systems at Twitter. • The Batch Layer manages the master dataset and computes views infrequently. • The Speed Layer is for real-time querying, the data is dropped as soon as it’s processed by the batch layer. • The Serving Layer brings together the Batch and Speed layers so they can be queried. 24
  25. 25. ©latitude51north.com Useful tools
  26. 26. ©latitude51north.com Ivory looks like an incredibly useful package for speeding up the modelling pipeline • Ivory is a package from Ambiata dubbed a datastore for features. • Very commonly in analysis or the modelling process we need to know the states and/or data summarisations for entities at historic points in time e.g. number of enquiries in month before PSD, had enquirer previously bought Motor-ever, number of visits in previous month, three months, 12 months, etc. • Ivory allows you to create easily create these features from a table of Events. https://speakerdeck.com/ambiata/improving- feature-engineering-in-the-lab-and-production- with-ivory 26
  27. 27. ©latitude51north.com Machine Learning
  28. 28. ©latitude51north.com The future of machine intelligence and why it matters - Shivon Zilis (Bloomberg Beta) • How machine learning can make your life easier. • http://cdn.oreillystatic.com/en/assets/1/event/126/The%20future%20of%20machine %20intelligence%20and%20why%20it%20matters%20Presentation%201.pdf • Some very good resources/tools were mentioned in this presentation: • Meeting Preparation: http://quid.com/ • Scheduling: https://claralabs.com/ https://x.ai/ • Competitive Analysis: http://mattermark.com/ https://app.datafox.co/ • Conference Calls: http://www.gridspace.com/ • Talent: https://www.textio.com/ • Emails: http://www.inboxvudu.com/ 28
  29. 29. ©latitude51north.com Forecasting space-time events – Predictive policing, Minority Report style • This session uses the speaker’s experience in building a crime forecasting package to outline some tools and techniques useful in modelling space-time event data. https://www.hunchlab.com/ • Concepts • While many data scientists work with data that includes geographic information, this data is often used in rather rudimentary ways or limited to vector data sets such as the point locations of stores or users. The session will introduce the strengths and weaknesses behind raster-based geographic analysis. Some challenges faced when modelling data at a fine geographic and temporal resolution will be discussed. For example, how can uncertainty around the time of occurrence for events be represented? • GeoTrellis - http://geotrellis.io/ • The case study leverages the open source GeoTrellis framework to conduct geographic processing. GeoTrellis is currently an incubating project within the Eclipse Foundation’s LocationTech working group. The project provides fast and scalable geographic processing with an emphasis on raster-based analysis and routing through transportation networks. Already written in Scala, GeoTrellis is currently being extended to integrate with Apache Spark. • Modelling • The modelling pipeline within the case study consists of several loosely coupled components. In addition to GeoTrellis, the project uses R for machine learning and the Amazon Simple Workflow service for pipeline orchestration. The presentation will outline the basic structure of the modelling process including details of the statistical techniques utilized within the process. • Several statistical techniques were examined throughout the development of the project. The final approach included a stacked model incorporating a gradient boosting machine (GBM) to model the presence of events, and a generalized additive model (GAM) to transform these predictions into expected counts. The session will conclude by outlining some approaches to evaluating predictive accuracy for these types of data sets. 29
  30. 30. ©latitude51north.com Data Visualisation and prototyping
  31. 31. ©latitude51north.com PDF is the second biggest religion in the UK and other amusing insights.. • Visualizing the world's largest democratic exercise • The election results page for the 2014 Indian general elections was hosted on CNN-IBN and bing.com. The focus was on real-time analysis of results for users and TV anchors. With over 540 million voters and 100 million viewers, the volume and complexity of data both provide a design challenge. This talk focuses on the techniques behind this design. • https://gramener.com/shows/slides/strata#/ • Situational awareness: This is not the data you're looking for • http://cdn.oreillystatic.com/en/assets/1/event/126/Situational%20awareness_%2 0This%20is%20not%20the%20data%20you_re%20looking%20for%20Presentatio n.pdf 31
  32. 32. ©latitude51north.com Accenture gave a great talk on the case for building an in-house data insights lab • Accenture talked about the challenges with traditional approaches for getting buy-in for data science solutions within companies. • Their solution is to make concepts practical for clients using open source technology and open data, creating visualisations, and mock- up a prototypes in a few weeks. • This requires a multi-disciplinary team of data scientists, engineers and designers who can use cutting edge technologies to bring concepts to life – ‘The Technology Lab’. • Their case study was a US Bank client. Accenture had difficulty convincing the Senior Execs of the value of a risk dashboard. The Technology Lab decided to build a prototype using open source software and open mortgage loan/default data. Once the execs got their hands on the prototype they soon gave the go ahead for a version using internal data • McKinsey also have a similar Digital Labs concept: http://www.mckinsey.com/client_service/mckinsey_digital/expertise/ digital_labs 32
  33. 33. ©latitude51north.com Other presentations
  34. 34. ©latitude51north.com Advanced Machine Learning • Deploying machine learning in production What could possibly go wrong ?- Alice Zheng (Dato) • Building and deploying predictive applications require knowing how to evaluate, test, and track the performance of machine learning models over time. Using available off-the-shelf tools, this talk engages potential application builders on topics such as common evaluation metrics, A/B testing set up, tracking model performance, tracking usage via real-time feedback, and updating models. • http://cdn.oreillystatic.com/en/assets/1/event/126/Deploying%20machine%20learning%20in%20production%20 Presentation%201.pdf • Scalable machine learning • While the data management side of Big Data has seen tremendous progress in the past few years, bringing technologies like Hadoop or Spark together with advanced machine learning and data analysis methods is still a major challenge. In this talk, I will discuss recent advances, approaches, and patterns which are used to build truly scalable machine learning solutions. • http://cdn.oreillystatic.com/en/assets/1/event/126/Scalable%20machine%20learning%20Presentation.pdf • Deep learning made doubly easy with reusable deep features • http://blog.dato.com/deep-learning-blog-post 34
  35. 35. ©latitude51north.com Data Science • The curiosity advantage: the most important skill for data science • Curiosity is one of the most valued skills for people working in Data Science. But how can we train it? Einstein said that "Curiosity is an important trait of a genius". Let’s explore how we can develop our curiosity with three exercises in the session: how to find pleasure in uncertainty; question the question we’re asking; and find a beginner's mind. With direct application to data science. • http://cdn.oreillystatic.com/en/assets/1/event/126/The%20curiosity%20advantag e_%20the%20most%20important%20skill%20for%20data%20science%20Present ation.pdf • Measuring the benefit effect for customers with Bayesian predictive modelling • http://cdn.oreillystatic.com/en/assets/1/event/126/Measuring%20the%20benefit %20effect%20for%20customers%20with%20Bayesian%20predictive%20modelin g%20Presentation.pdf 35
  36. 36. ©latitude51north.com Other • Using Data for evil • http://www.slideshare.net/DuncanRoss1/using-data-for-evil-2 • Apache Flink • http://www.slideshare.net/stephanewen1/apache-flink-overview-and-use-cases- at-prehadoop-summit-meetups?next_slideshow=1 • Moves the Needle brings Lean Startup principles, tools, tactics and strategy to the enterprise • http://www.movestheneedle.com/ 36
  37. 37. ©latitude51north.com Resources • Speakers and slides • http://strataconf.com/big-data-conference-uk-2015/public/schedule/speakers • O’Reilly data blog • https://beta.oreilly.com/topics/data • Spark Training slides • http://training.databricks.com/workshop/sparkcamp.pdf 37

×