Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Jumpstarting
Big Data Projects
Architectural
Considerations in
HDInsight
Up to 75 control units in 1 vehicle
About 1,000 individual possible extra equipments
1 GB car software, 15 GB data on boar...
</meldungText><antwort>False</antwort><wert>na</wert></meldung><steuergeraet sgbdVariante="SMG_60"><steuergeraeteFunktion
...
HDInsight
http://aka.ms/HDIpowershell
Hadoop on Azure
https://github.com/lararubbelke/Azure-
DDP/
Databases Storage Account
Others
Hadoop does not do
well with lots of small files
http://aka.ms/HDI_smallfiles
Clean & prepare data
Customer polarisation & segmentation
- SQL-like abstraction of
MapReduce
- Compatible with other
RDBMs
- Simple aggregations
- DBAs
- Complex Joins
- Unstructu...
- Slice & dicing (or
complex query)
- Streaming objects
- UDF
- Software developers
- Integration with analytical
tools/RD...
Feature vector per customer
Scoring and prioritisation
Mahout Azure Machine Learning
Distributed similarity measure
Version number of Mahout
Input
Output
Limit of #similar items per item
Max #preferences per user/item
1.
2.
3.
4.
user/<remote_user>
user/hdp/
user/<remote_user>
r>@<storage_account>.blob.core.windows.
net/user/hdp/testdata/KDDTest
hadoop jar
C:appsdistmahout-0.9mahout-core-0.9-job.jar
org.apache.mahout.classifier.df.tools.Describe
-p wasb:///user/hdp/...
Leaf size
Data
Data Set
Selection #Trees
Partial
Output
Test Data
Data Set / Descriptor File
Forest Model
Analysing the
classification results
Use MapReduce Jobs
Output
9,458 253
8,325
Predicted
4,508
normal anomaly
Actual
normalanomaly
accuracy =
#correctly classified instances
#classified instances
Scoring and
prioritization
Read & process
data for the day
 Extract into outputs
Deploy compute
Run workload
Monitoring
$coreConfig = @{
"io.compression.codec"="org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCode...
Example: Oozie job configuration
nameNode=wasb://container_name@storage_name.blob.core.windows.net
jobTracker=jobtrackerho...
12/29 6/29
Customer
ID
URL Item Time
Customer
ID
Item
IIS Logs
IIS Logs
Raw Data
Storage
Processing
Processed
Data
Storage
Validation &
Troubleshooting
Raw Data
Storage
Processing
Processed
Data...
Compute
Mahout / Pig
Calculations
I/O
HDFS
Blob
Raw Data
Storage
Processing
Processed
Data
Storage
Raw Data
Storage
Processing
Processed
Data
Storage
AuthorizationError,
Availability,
AverageE2ELatency,
AverageServerLatency,
ClientTimeoutError,
NetworkError,
PercentAuthor...
0
200
400
600
800
1000
1200
Jobs StorageTotal
2014-03-26 22:28:37,321 INFO CallbackServlet:539 - USER[-] GROUP[-] TOKEN[-]...
http://blogs.technet.com/b/oliviaklose/
http://alexeikh.wordpress.com/
http://blogs.msdn.com/b/bigdatasupport/
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Jumpstarting big data projects / Architectural Considerations of HDInsight Applications
Upcoming SlideShare
Loading in …5
×

Jumpstarting big data projects / Architectural Considerations of HDInsight Applications

1,748 views

Published on

Hadoop is everywhere, also in the fashion e-commerce. We recently built a solution with a leader in fashion e-commerce to drive their visitor experience and provide a personalised view using recommendations.

In this presentation, we will walk through the architectural considerations and decisions made in building the solution on Microsoft Azure and especially in HDInsight (the Hadoop implementation as a platform as a service on Microsoft Azure), highlighting the following stages:
• Data Storage
• Data Ingestion
• Data Processing
• Orchestration
• Validation & Troubleshooting

As we go through each stage, we present the technology options (both from the Microsoft Azure and Open Source stack) and our reasoning for selecting the optimal one in our case.

Published in: Data & Analytics
  • Be the first to comment

Jumpstarting big data projects / Architectural Considerations of HDInsight Applications

  1. 1. Jumpstarting Big Data Projects
  2. 2. Architectural Considerations in HDInsight
  3. 3. Up to 75 control units in 1 vehicle About 1,000 individual possible extra equipments 1 GB car software, 15 GB data on board (incl. navi) 2,000 user functions implemented 12,000 types of error stored onboard for diagnosis Daily up to 60,000 car diagnosis worldwide
  4. 4. </meldungText><antwort>False</antwort><wert>na</wert></meldung><steuergeraet sgbdVariante="SMG_60"><steuergeraeteFunktion zeitstempel="2013-04-30T09:00:37.9926171-04:00" endDate="2013-04-30T09:00:38.1158609-04:00" jobName="STATUS_FAHRZEUGTESTER"><datensatz satzNr="1"><result name="JOB_STATUS">OKAY</result><result name="_TEL_ANTWORT">80 F1 18 70 70 02 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 82 6B 00 6D 6B 39 CD 14 00 14 00 00 0E 00 15 00 0A 00 19 00 0C 00 12 00 15 85 57 71 88 81 C0 7D 73 C2 08 01 05 02 F7 00 FF FF 01 73 00 00 02 A8 00 C2 00 01 E0 00 00 00 00 00 00 3D 01 00 00 00 01 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 FD 01 E1 02 05 01 F8 03 4F FF AD 04</result><result name="_TEL_AUFTRAG">83 18 F1 30 02 01</result><result name="STAT_KL15_ROH">0</result><result name="STAT_KLR_EIN_ROH">0</result><result name="STAT_WAKE_UP_ROH">1</result><result name="STAT_ISTGANG_TEXT">Neutral</result> <sgFunktion zeitstempel=“2013-04-30T10:33:37.0834084+02:00" endDate="2013-04-30T10:33:37.9310504+02:00" jobName="_FLM_LESEN_BOSCH"><datensatz satzNr="1"><result name="FLM_DATEN_1">00 00 00 03 02 08 C6 56 46 4C 4D 39 00 16 4B B2 00 00 00 32 00 00 06 99 00 00 00 65 00 00 18 6E 00 00 00 73 00 00 00 20 00 00 00 73 00 00 00 00 00 00 10 69 00 00 0F 53 00 00 00 2C 00 00 00 0A 00 00 79 6D 00 00 B7 34 00 00 D3 9E 4A 4C 41 52 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2C 00 00 00 00 00 00 1A 5C 00 15 4B CA 00 00 44 08 00 00 2D 39 00 00 1E 45 00 00 26 89 00 00 1E EB 00 00 0C 65 00 00 04 47 00 00 00 00 00 00 00 00 00 00 00 04 00 00 00 27 00 00 01 1E 00 00 02 AB 00 00 07 71 00 00 13 D7 00 00 36 48 00 15 91 AD 00 00 3F 97 00 00 19 C1 00 00 07 F9 00 00 02 D4 00 00 00 BD 00 00 00 20 00 16 1C 42 00 00 18 B1 00 00 09 40 00 00 08 9F 00 00 04 3A 00 00 01 3E 00 01 8C D7 00 00 61 A3 00 00 37 9D 00 00 1E 78 00 00 14 96 00 00 0A 71 00 00 05 49 00 00 02 B1 00 00 00 A7 00 00 00 1D 00 00 00 09 00 00 00 05 00 00 00 00 00 00 00 00 00 00 23 BB 00 00 2F 84 00 00 14 EF 00 00 09 40 00 00 04 71 00 00 03 34 00 00 02 12 00 00 01 AC 00 00 01 59 00 00 0B C4 00 00 00 06 00 00 00 38 00 00 00 19 00 00 00 01 00 00 00 00 00 00 00 04 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 52 4F 54 48 00 00 00 00 00 00 00 07 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00 00 56 30 00 00 00 03 00 11 00 01 01 06 00 00 00 00 00 00 00 00 00 01 00 00 00 0E 00 05 00 1A 00 12 00 00 00 26 00 00 00 00 00 0B 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 44 44 00 43 00 16 00 08 00 0D 00 04 00 02 00 00 00 02 00 11 00 20 00 1A 00 0A 00 15 00 0F 00 1B 00 13 00 08 00 08 00 00 00 00 00 07 00 0E 00 08 00 04 00 02 00 01 00 00 00 6D 00 03 00 02 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0A 00 21 00 15 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0B 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 18 05 1F 00 00 00 00 00 00 00 00 00 1F 00 03 00 02 00 00 00 00 00 00 00 20 00 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 62 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2E 00 00 1B 00 19 00 18 00 0D 00 00 00 00 00 00 00 01 00 01 00 02 00 00 06 00 01 E6 00 00 12 00 03 00 02 00 07 00 00 00 00 00 00 00 00 00 00 00 00 00 04 00 02 01 BA 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 24 00</result><result name="FLM_DATEN_2">08 00 00 00 00 00 00 00 00 00 00 0C 00 80 1B 00 45 10 00 A6 0D 00 51 16 00 59 44 00 00 EB 00 00 CA 00 00 49 00 00 17 00 10 00 0C 00 05 00 04 00 06 00 02 00 01 00 00 00 00 12 00 00 3A… Here!
  5. 5. HDInsight http://aka.ms/HDIpowershell Hadoop on Azure https://github.com/lararubbelke/Azure- DDP/
  6. 6. Databases Storage Account Others
  7. 7. Hadoop does not do well with lots of small files http://aka.ms/HDI_smallfiles
  8. 8. Clean & prepare data Customer polarisation & segmentation
  9. 9. - SQL-like abstraction of MapReduce - Compatible with other RDBMs - Simple aggregations - DBAs - Complex Joins - Unstructured data - UDF is complex
  10. 10. - Slice & dicing (or complex query) - Streaming objects - UDF - Software developers - Integration with analytical tools/RDBMs
  11. 11. Feature vector per customer Scoring and prioritisation
  12. 12. Mahout Azure Machine Learning
  13. 13. Distributed similarity measure Version number of Mahout
  14. 14. Input Output
  15. 15. Limit of #similar items per item Max #preferences per user/item
  16. 16. 1. 2. 3. 4.
  17. 17. user/<remote_user> user/hdp/ user/<remote_user> r>@<storage_account>.blob.core.windows. net/user/hdp/testdata/KDDTest
  18. 18. hadoop jar C:appsdistmahout-0.9mahout-core-0.9-job.jar org.apache.mahout.classifier.df.tools.Describe -p wasb:///user/hdp/testdata/KDDTrain+.arff -f wasb:///user/hdp/testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L Describing the columns
  19. 19. Leaf size
  20. 20. Data Data Set
  21. 21. Selection #Trees Partial Output
  22. 22. Test Data Data Set / Descriptor File Forest Model
  23. 23. Analysing the classification results Use MapReduce Jobs Output
  24. 24. 9,458 253 8,325 Predicted 4,508 normal anomaly Actual normalanomaly
  25. 25. accuracy = #correctly classified instances #classified instances
  26. 26. Scoring and prioritization Read & process data for the day  Extract into outputs
  27. 27. Deploy compute Run workload Monitoring
  28. 28. $coreConfig = @{ "io.compression.codec"="org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.ha doop.io.compress.BZip2Codec"; "io.sort.mb" = "1024"; } $mapredConfig = new-object 'Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightMapReduceConfiguration' $mapredConfig.Configuration = @{ "mapred.tasktracker.map.tasks.maximum"="2"; } $clusterConfig = New-AzureHDInsightClusterConfig -ClusterSizeInNodes $numberNodes ` | Set-AzureHDInsightDefaultStorage -StorageAccountName $fqStorageAccountName -StorageAccountKey $storageAccountKey ` -StorageContainerName ($storageContainer.Name) $continueCheck = Read-Host "Attach additional storage accounts? (yes to continue)" if ($continueCheck -eq "yes") { foreach($asa in 1..5) { $newStorageAccountName = ($clusterPrefix + [DateTime]::Now.ToString("yyyyMMddHHmmss") + "a" + $asa) New-AzureStorageAccount -StorageAccountName $newStorageAccountName -Location "North Europe" $clusterConfig = $clusterConfig | Add-AzureHDInsightStorage ` -StorageAccountName ($newStorageAccountName + ".blob.core.windows.net") ` -StorageAccountKey (Get-AzureStorageKey $newStorageAccountName).Primary } } $clusterConfig = $clusterConfig | Add-AzureHDInsightConfigValues -Core $coreConfig -MapReduce $mapredConfig # "At this point we are able to create a hdinsight cluster with a customised configuration" http://aka.ms/HDIconfiguration
  29. 29. Example: Oozie job configuration nameNode=wasb://container_name@storage_name.blob.core.windows.net jobTracker=jobtrackerhost:9010 queueName=default oozie.wf.application.path=wasb:///user/admin/examples/apps/ooziejobs outputDir=ooziejobs-out oozie.use.system.libpath=true
  30. 30. 12/29 6/29 Customer ID URL Item Time Customer ID Item
  31. 31. IIS Logs IIS Logs
  32. 32. Raw Data Storage Processing Processed Data Storage Validation & Troubleshooting Raw Data Storage Processing Processed Data Storage
  33. 33. Compute Mahout / Pig Calculations I/O HDFS Blob
  34. 34. Raw Data Storage Processing Processed Data Storage
  35. 35. Raw Data Storage Processing Processed Data Storage
  36. 36. AuthorizationError, Availability, AverageE2ELatency, AverageServerLatency, ClientTimeoutError, NetworkError, PercentAuthorizationError, PercentNetworkError, PercentSuccess, ServerTimeoutError, Success, ThrottlingError Timestamp, TotalBillableRequests, TotalEgress, TotalIngress, TotalRequests Application Storage throttling When? Data exchange 0 2E+09 4E+09 6E+09 8E+09 1E+10 1.2E+10 1.4E+10 Jobs Storage Sum of TotalIngress Sum of TotalEgress
  37. 37. 0 200 400 600 800 1000 1200 Jobs StorageTotal 2014-03-26 22:28:37,321 INFO CallbackServlet:539 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000000-140326181153083-oozie-hdp-W] ACTION[0000000- 140326181153083-oozie-hdp-W@pig-node-01] callback for action [0000000-140326181153083-oozie-hdp-W@pig-node-01] 2014-03-26 22:28:37,472 INFO PigActionExecutor:539 - USER[Admin] GROUP[-] TOKEN[] APP[receipts-products-mahout] JOB[0000000-140326181153083-oozie-hdp-W] ACTION[0000000-140326181153083-oozie-hdp-W@pig-node-01] action completed, external ID [job_201403261811_0001] 2014-03-26 22:28:37,562 WARN PigActionExecutor:542 - USER[Admin] GROUP[-] TOKEN[] APP[receipts-products-mahout] JOB[0000000-140326181153083-oozie-hdp- W] ACTION[0000000-140326181153083-oozie-hdp-W@pig-node-01] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.PigMain], exit code [2] 2014-03-26 22:28:38,101 INFO ActionEndXCommand:539 - USER[Admin] GROUP[-] TOKEN[] APP[receipts-products-mahout] JOB[0000000-140326181153083-oozie- hdp-W] ACTION[0000000-140326181153083-oozie-hdp-W@pig-node-01] ERROR is considered as FAILED for SLA 2014-03-26 22:28:38,228 WARN JPAService:542 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] JPAExecutor [WorkflowActionGetJPAExecutor] ended with an active transaction, rolling back 2014-03-26 22:28:38,343 INFO ActionStartXCommand:539 - USER[Admin] GROUP[-] TOKEN[] APP[receipts-products-mahout] JOB[0000000-140326181153083-oozie- hdp-W] ACTION[0 High Latency  Timeout
  38. 38. http://blogs.technet.com/b/oliviaklose/ http://alexeikh.wordpress.com/ http://blogs.msdn.com/b/bigdatasupport/

×