Learn more about the tools, techniques and technologies for working productively with data at any scale. This presentation introduces the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.
Jon Einkauf, Senior Product Manager, Elastic MapReduce, AWS
Alan Priestley, Marketing Manager, Intel and Bob Harris, CTO, Channel 4
8. Generated data
Available for analysis
Data volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
9. Elastic and highly scalable
No upfront capital expense
Only pay for what you use
+
+
Available on-demand
+
=
Remove
constraints
32. How does it work?
EMR
EMR ClusterS3
1. Put the data
into S3 (or HDFS)
3. Get the
results
2. Launch your cluster.
Choose:
• Hadoop distribution
• How many nodes
• Node type (hi-CPU,
hi-memory, etc.)
• Hadoop apps (Hive,
Pig, HBase)
40. Give it a try.
Cost to run a 100-node EMR cluster:
£4.90 / hour
41. AWS Data Pipeline
Data-intensive orchestration and automation
Reliable and scheduled
Easy to use, drag and drop
Execution and retry logic
Map data dependencies
Create and manage temporary compute
resources
48. C4 in the Cloud
• 2008 – Started investigations into Cloud Computing
• 2008 – Launched our first applications on AWS
• 2009 – Entered into an Enterprise Agreement with Amazon for AWS
Rapid growth of AWS based offerings during 2009/2010
• 2011 – AWS established as the default platform of choice for new websites
50. C4 in the Cloud
• 2008 – Started investigations into Cloud Computing
• 2008 – Launched our first applications on AWS
• 2009 – Entered into an Enterprise Agreement with Amazon for AWS
Rapid growth of AWS based offerings during 2009/2010
• 2011 – AWS established as the default platform of choice for new websites
• 2012 – Adopted cloud-based analytics
• 2013 – Investigating cloud-based back-up and archiving
52. Business Intelligence at C4
• Well established Business Intelligence capability
• Based on industry standard proprietary products
• Real-time data warehousing
• Comprehensive business reporting
• Excellent internal skills
• Good external skills availability
53. Big Data at C4
2011
• Embarked on Big Data initiative in 2011
• Ran in-house and cloud-based PoCs
• Selected AWS Elastic Map Reduce
2012
• Ran EMR in parallel with conventional BI stack
• Hive deployed to Data Analysts in 2012
• EMR workflows deployed to production in 2012
2013
• EMR confirmed as primary Big Data platform
• EMR usage growing, focus on automation
• Experimenting with R and Mahout
54. Big Data at C4 – Elastic MapReduce
• AWS EMR established as our Big Data platform of choice
• Friendly front-end developed to allow Data Analysts to
start/stop clusters and submit/track queries.
56. Big Data at C4 – Elastic MapReduce
• AWS EMR established as our Big Data platform of choice
• Friendly front-end developed to allow Data Analysts to
start/stop clusters and submit/track queries.
• Production workflows written predominantly in Python and
Pig
• Fully integrated with our conventional BI stack making
EMR outputs available for reporting
• Experimenting with ADP (AWS Data Pipeline)
• Next steps – MapR and HBase
57. Personalising the viewer experience
Most popular dramas
Drama
collections
US drama
Single view of the viewer
recognising them across devices
and serving relevant content
Big Data – Improving Viewer Experience
58. Myths or Truths? – It’s all about Perspective!
• Nothing that can’t be done with an RDBMS
• It’s a completely different approach
• It’s really difficult
• It’s immature and lacks good tools
• It’s totally incompatible with you current BI platform
and tools
• It’s difficult to find skilled and experienced staff
Image by Tayrawr Fortune
Elastic MapReduce has provided a cost effective
approach to establishing our Big Data platform
61. Analysis of Data Can Transform Society
Create new business
models and improve
organizational
processes.
Enhance scientific
understanding, drive
innovation, and
accelerate medical cures.
Increase public safety
and improve
energy efficiency with
smart grids.
62. Democratizing Analytics gets Value out of Big Data
Unlock Value in
Silicon
Support Open
Platforms
Deliver Software Value
63. Intel at the Intersection of Big Data
Enabling exascale
computing on massive
data sets
Helping enterprises
build open
interoperable clouds
Contributing code
and fostering
ecosystem
HPC Cloud Open Source
64. Intel at the Heart of the Cloud
Server
Storage
Network
65. Scale-Out Platform Optimizations for Big Data
Cost-effective performance
•Intel® Advanced Vector Extension Technology
•Intel® Turbo Boost Technology 2.0
•Intel® Advanced Encryption Standard New
Instructions Technology
66. 66
Intel® Advanced Vector Extensions Technology
• Newest in a long line of
processor instruction
innovations
• Increases floating point
operations per clock up to
2X1 performance
1 : Performance comparison using Linpack benchmark. See backup for configuration details.
For more legal information on performance forecasts go to http://www.intel.com/performance
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
67. Intel® Turbo Boost Technology 2.0
More Performance
Higher turbo speeds maximize
performance for single and
multi-threaded applications
68. Intel® Advanced Encryption
Standard New Instructions
• Processor assistance for
performing AES encryption
7 new instructions
• Makes enabled encryption
software faster and stronger
69. Power of the Platform built by Intel
Richer
user
experiences
4HRS
50%
Reduction
10MIN
80%
Reduction 50%
Reduction 40%
Reduction
TeraSort for
1TB sort
Intel®
Xeon®
Processor
E5 2600
Solid-State
Drive 10G
Ethernet Intel® Apache
Hadoop
Previous
Intel®
Xeon®
Processor