#hortonworks sandbox
Explore tagged Tumblr posts
Text
Resource Links: HDP Sandbox
Because I keep having the same specific problems
https://www.cloudera.com/downloads/hortonworks-sandbox/hdp.html
^download here
https://www.cloudera.com/tutorials/learning-the-ropes-of-the-hdp-sandbox.html
^official beginners guide
https://maelfabien.github.io/bigdata/HortonWorks/
^unofficial beginners guide (note to self: this has the ssh thing)
https://stackoverflow.com/questions/29223588/ambari-name-node-startup-fails-when-safe-mode-is-on#29223650
^for when you get stuck in safe mode
https://community.cloudera.com/t5/Support-Questions/What-is-the-best-way-to-shut-down-Hortonworks-sandbox/td-p/173993
^ how to shut down safely
0 notes
Text
Post 20 | HDPCD | Removing Duplicate tuples from a PIG Relation
Post 20 | HDPCD | Removing Duplicate tuples from a PIG Relation
Hi everyone, welcome to one more tutorial in this HDPCD certification series. As you might notice, I have changed the blog layout a little bit, hope you like it. Kindly let me know your feedback on this in the COMMENT SECTION.
In the last tutorial, we saw how to perform the SORT OPERATION in Apache PIG. In this tutorial, we are going to remove the duplicate tuples from a Pig Relation.
Let us…
View On WordPress
#apache#apache pig#Big Data#big data certification#big data tutorial#certification#certification preparation#certified#hadoop#hadoop certification#hadoop commands#hadoop tutorial#HDPCD#hdpcd big data#hdpcd certification#Hortonworks#hortonworks big data#hortonworks certification#hortonworks data platform certified developer#hortonworks sandbox#pig#pig relation#pig relations#remove#remove duplicate#removing duplicate tuples from pig relation#removing duplicates#removing duplicates from pig relation#tuple#tuples
0 notes
Text
Sandboxes and their advantages
If we talk about the development of Hadoop technology then there are two companies which are doing a lot in this field. One is Hortonworks and another is Cloudera. These companies are developing a lot of new ideas and software in the field of Hadoop to make it easier to use and developing a lots of applications on them. These companies provide tools to use and learn Hadoop.
Hortonworks provides “Hortonworks Sandbox” and Cloudera Provides “ Cloudera Quick start VM ” . These tools are a package in which Hadoop is configured along with the tools is needed in developing Hadoop environment. We will discuss about the benefits of using these tools.
Hortonworks Sandbox
This sandbox having the Hortonworks Data platform in an easy form but it comes with a terminal which is easy to handle. This sandbox provides
A virtual machine with Hadoop configuration.
Some basics about Hadoop to get you started.
Different tools that will help you in the Hadoop ecosystem like Apache Hive, Apache Pig, Apache HBASE and many more.
The Hortonworks Hadoop Sandbox is delivered as a virtual appliance that is a bundled set of operating system, configuration settings and the applications that work together as a unit. The virtual appliances runs in a virtual machine with the application to the host operating system. To run Hortonworks sandbox you must install one of the supported VM environment on the host machine either Oracle Virtual Box or VMware Fusion (Mac) or Player (Windows/Linux).It doesn’t have any user interface like “Cloudera sandbox”. It having an interface like prompt command in which you need to write command line to execute any operation.
Cloudera Quickstart VM
Cloudera also provides the tools for learn and use Hadoop. From Hortonworks it is “Hortonworks Sandbox” and from Cloudera it is “ Cloudera Quickstart VM” . It is also a virtual machine along with the package of different tools and software which is used in Hadoop related work. It is available in free but along with Cloudera manager it’s available for 60 days trail and paid version. It includes different software like Hadoop, HBASE, Spark, Oozie etc. Cloudera Quickstart virtual machine provides some basics to get you started. It is frequently used in multi node clusters.
Comparison between Hortonworks Sandbox and Cloudera Quickstart VM are:
Hortonworks is open source i.e. completely free whereas Cloudera Quick VM is also completely free but with Cloudera manager it having a free 6o days trail and paid version
Both support Map Reduce as well as Yarn.
Both distributions have master-slave architecture.
Horton works having “ Ambari ” interface while Cloudera having “ Hue” as GUI.
Both are Linux supported.
Benefits of using these sandboxes in OSP to offer unique setup to clients are:
Both are open source so on the cost aspects it will be a big benefit. We will have Hadoop configuration with different supporting software like Hive, HBASE, Eclipse, Spark and many more. So we can provide better service to client without worrying about the setup cost.
As we discussed above that sandboxes contains all the supporting tools used in Hadoop ecosystem so it will be beneficial to work in Hadoop environment because of availability of all tools as a package. It will increase the productivity.
Sandboxes platforms are easy to use form. We can add our own data sets and connect it to existing tools and applications.
These sandboxes also provides some basic of Hadoop and other tools to learn. It will helpful to a new employee for a new setup.
0 notes
Text
How to integrate Hadoop and Teradata using SQL-H
#ICYDK: I have tried Hadoop Connector for Teradata, Teradata Connector for Hadoop, Teradata Studio Express, Aster SQL-H, and many more cumbersome alternatives, finally to reach the Hadoop-Teradata integration without purchasing QueryGrid current version. However, without QueryGrid, you cannot do cross-platform querying. Here, we just demonstrate bidirectional data transfer between Teradata and Hadoop. All that I needed for Teradata seamlessly integrate with Hadoop were these: * Hadoop Sandbox 2.1 for VMware (http://hortonworks.com/hdp/downloads) * Teradata Express 15 for VMware (http://downloads.teradata.com/downloads) * Teradata Connector for Hadoop (TDCH) (http://downloads.teradata.com/downloads) * Teradata Studio (http://downloads.teradata.com/downloads) I didnt need to connect Teradata Aster, because all I needed was querying and data transfer between Hadoop and TD. Here is how it happened: 1. I converted the OVA file I got from Hortonworks Sandbox download page, into a VMX file for running into VMware Server. The command for converting is this ovftool.exe Hortonworks_Sandbox_2.1.ova D:/HDP_2.1_Extracted/HDP_2.1_vmware where HDP_2.1_vmware is the VMDK file extracted. The extraction took an hour on a fast server. 2. I loaded the HDP_2.1_vmware.vmdk into VMware Server by choosing to add a new virtual machine. VMDK file made the VMX as I specified the VM configurations. I chose NAT for network connection, also chose USB driver option for VM. When turning on the VM, it asked the question that SCSI device (USB) is not working so should the VM boot from IDE. Thats the recommended option so I chose it. VM worked, run and I could browse into Hortonworks Sandbox by typing http://sandbox.hortonworks.com:8000. I could also use the port 50070 to access WebHDFS. I just changed the password for hue in the user admin section of the site at http://sandbox.hortonworks.com:8000. 3. Now I needed to install Teradata 15 and Teradata Studio and connect the two. It worked well, and there is a lot of documentation to troubleshoot if anything comes in connecting TD15 to Teradata Studio. When I could not connect TD15 the first time, I got error in Teradata Administrator “Connection Refused”. I just restarted the SUSE Linux OS on which TD 15 VM resides, and I could connect well. 4. Now the last part was to install an RPM file of Teradata Connector for Hadoop (TDCH) in the Hadoop Hortonworks Sandbox I just launched in step 2. For this, I used Putty to connect to HDP2.1 shell. I put the IP designated to sandbox.hortonworks.com in PUTTY, and connected on default port 22. I logged in as root, hadoop as username, password. Then I went to /usr/lib/ . There were installations of java 1.7 , hive, sqoop, etc. I just needed to check that java version is 1.7 or above. Now using FileZilla I transferred TDCH rpm file to /usr/lib. Then I run the command to install rpm rpm -Uvh teradata-connector-1.3.2-hdp2.1.noarch.rpm It installed the rpm as verbose (-v), showing me all the details. 5. Now I needed to run the oozie configurations as specified on the Teradata Studio download page in the installation instuctions. namenode was set to sandbox.hortonworks.com . webHDFS hostname and webHDFS port need not be set as they default to name node and 50070 respectively which works. 6. Now open the Teradata Studio. Add a new database connection. Specify the Hadoop Database credentials including WebHDFS host name: sandbox.hortonworks.com WebHDFS port:50070 username: hue I tested the connection. Firstly ping failed. But after long pause of waiting, which meant that it was in the middle of processing. The java error exception showed “cannot access oozie service”. So I closed the root connection through PUTTY as I was first trying to give username root. I also later closed hue connections online on sandbox.hortonworks.com so that the connection does not get timed out. Then the ping succeeded after a 20 sec pause. 7. Once both Teradata and Hadoop Distributed File System were connected to Teradata Studio, I could transfer data to and from both databases. It is done. https://goo.gl/zroF44 #DataScience #Cloud
0 notes
Link
Apache SQOOP for Big Data Hadoop Beginners What you’ll study By taking this course an individual will get the essential information of Apache Sqoop & Big Data. Install & configure Hortonworks Data Platform (HDP) Sandbox on Windows Machine Description Apache Sqoop structure is designed to import information from relational databases similar to Oracle, MySQL, and so forth to Hadoop methods. Hadoop is right for batch processing of giant quantities of knowledge. It is trade customary these days. In actual world situations, utilizin https://www.couponudemy.com/blog/apache-sqoop-for-big-data-hadoop-beginners/
0 notes
Text
JavaScript Charts: Comparing D3 to Kendo UI for Data Visualization
#kendo [DZone]Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample ...
0 notes
Text
New Post has been published on Details of the treatment of certain diseases. Human Diseases and methods of treatment
New Post has been published on http://bit.ly/2ABgCsU
Data lake tools for visual studio with hortonworks sandbox - azure hdinsight microsoft docs urticaria hives home remedies
0 notes
Text
Big Data Startups - Datamation
Big Data Startups Datamation Big Data startups are emerging quickly, because Big Data itself is quickly moving from emerging technology to mature technology. Companies that were startups five years ago are now key playes, like Cloudera and Hortonworks. Those companies, which ... Cask Lowers Barrier to Big Data Adoption Further with Launch of CDAP Cloud Sandbox for AWSMarketwired (press release) all 1 news articles »
http://ift.tt/2tug2JN via http://ift.tt/18iWci2
0 notes
Text
Sticky
Hello, random tumblr user and/or fellow aspiring data scientists. I’m a former prep cook / pastry cook / technical writer learning big data programming with scala. Since I’m not really sure how to blog about that yet, this blog will soon be filled with links to resources for what I’m learning. I may also occasionally reblog something Animal Crossing related, and I may eventually figure out how to blog about what I’m learning.
I’m going to keep an index here so it’s easy to find my resource posts, since tumblr can be difficult to navigate. I’ll add links to resource posts as I make them.
Hortonworks Data Platform (HDP) Sandbox: post link
Spark SQL resource links here: post link
Agile / Scrum resource links here: post link
Planned future posts:
Big data
Scala
Python
SQL vs NoSQL
MongoDB
Hadoop
Hive
Spark
Kafka
0 notes
Text
Post 19 | HDPCD | Sort the output of a Pig Relation
Post 19 | HDPCD | Sort the output of a Pig Relation
Hi everyone, thanks for coming back again to continue with this tutorial series. We are almost there with this section, and once we are done with this, we will jump into Hive, which will not take much time.
In the lasttutorial, we saw the process to store the data from PIG to HIVE using HCatalog. That was a tutorial and based on the feedback that I got, this is going to be smaller as compared to…
View On WordPress
#apache#apache pig#Big Data#big data certification#big data tutorial#certification#certification preparation#certified#hadoop#hadoop certification#hadoop commands#hadoop tutorial#HDPCD#hdpcd big data#hdpcd certification#Hortonworks#hortonworks big data#hortonworks certification#hortonworks data platform certified developer#hortonworks sandbox#sort#sort in pig#sort operation#sort output in pig#sort output of pig relation#sort the output#sort the output of a pig relation#sorting#sorting in apache pig#sorting in pig
0 notes
Text
Running php on Hortonworks sandbox
Running php on Hortonworks sandbox
I am trying to design my front end for my back end services which involves hadoop and hive. I was successfully able to open up XAMPP server on port 8085 of Hortonworks 2.4. For that I had to stop httpd services already running. I was also successfull in writing php codes which talked with mySQL services. However some thing astonishing which I have noticed right away is that on firing hadoop or…
View On WordPress
0 notes
Text
Experian passe du mainframe à Hadoop pour traiter les dossiers crédit
Le CIO d’Experian, Barry Libenson, s’appuie sur Hadoop pour traiter plus rapidement les dossiers de crédit de ses clients. (Crédit : Experian)
Grâce au logiciel open source Hadoop, mais aussi à des microservices et à des API, Experian peut traiter rapidement des quantités massives de données livrer plus rapidement les analyses financières à ses clients dans la banque et l’assurance.
Experian a mis en place un système d’analyse de données qui lui permet de réduire de plusieurs jours à quelques heures le temps qu’il faut pour traiter les Pétaoctets de données provenant de centaines de millions de clients à travers le monde. La société irlandaise de services d’information a déployé une couche data fabric basée sur le système de traitement de fichiers Hadoop, parallèlement à des microservices et une plate-forme API. Grâce à ces systèmes, les entreprises et les consommateurs peuvent accéder plus rapidement aux comptes rendus financiers et aux informations de crédits. « C’est un événement qui change véritablement la donne pour nos clients, car il leur permet d’avoir un accès en temps réel à des informations qui demandent généralement un certain temps d’analyse et ne sont habituellement pas accessibles immédiatement », explique ainsi le CEO d’Experian, Barry Libenson. Hadoop, un outil open source conçu pour piloter de gros projets big data, est devenu un élément incontournable de nombreuses stratégies d’analyse, les DSI souhaitant mettre à la disposition des clients des produits et des services d’information. La technologie utilise des techniques de traitement parallèle qui permettent, avec les logiciels adéquats, de traiter plus rapidement de plus grosses quantités de données que les outils de gestion de données basés sur SQL.
Hadoop accélère le traitement des données
Quand Barry Libenson a pris la direction d’Experian en 2015, l’entreprise utilisait encore des systèmes mainframe pour traiter les requêtes. Mais, les données à gérer augmentaient à un rythme exponentiel. À l’époque, les ingénieurs devaient absorber et traiter au fur et à mesure les fichiers de données, puis homogénéiser et nettoyer les informations avant de les transmettre à l’entreprise cliente. Pour répondre aux nouvelles exigences de gestion des données, ils ont ajouté plus de processeurs. Cependant, à l’époque, sur Amazon.com, les clients pouvaient commander en quelques clics de souris des chaussures ou de la puissance de calcul. Barry Libenson savait qu’Experian avait besoin d’une stratégie de gestion des données plus fluide capable de fournir une analyse des données en temps réel.
Comme d’autres entreprises, Experian testait de nouveaux outils de traitement des données. L’entreprise s’amusait avec des variantes de Hadoop comme Cloudera, Hortonworks et MapR dans des sandbox sur site ou sur Amazon Web Service (AWS). Mais le CEO savait que si Experian voulait extraire des données pertinentes de ses sources et délivrer de nouveaux produits à ses millions de clients, l’entreprise avait besoin d’une plateforme qui lui permettrait de standardiser son process. Après quelques tests, Barry Libenson a opté pour Cloudera. Le système multitenant fonctionne sur site dans le cloud hybride d’Experian. Le CEO précise cependant que l’entreprise peut, si nécessaire, augmenter sa capacité de calcul en utilisant AWS. Une institution de crédit colombienne est l’un des premiers clients à profiter du data fabric Hadoop d’Experian. Grâce aux capacités de traitement en temps réel d’Hadoop, Experian a pu traiter 1000 états financiers en moins de six heures contre six mois précédemment avec son système mainframe qui n’a pas la capacité de normaliser et nettoyer les données de plus d’un seul état à la fois. « Les clients savent qu’ils vont disposer de données en temps quasi réel et qu’ils ne risquent pas de recevoir des données périmées », a encore déclaré le CEO.
Microservices et appels d’API
Avec de tels résultats, on peut se demander pourquoi plus d’entreprises n’ont pas encore opté pour Hadoop. La plateforme détient une part modeste, mais croissante du marché du Big data et des technologies d’analyse d’entreprise. Selon IDC, ce marché génèrera 187 milliards de dollars en 2019. Dans la pratique, le logiciel peut être compliqué à mettre en œuvre, notamment parce qu’il est difficile de trouver des ingénieurs connaissant bien la technologie. Le traitement parallèle et le traitement des informations non structurées répondent à une autre logique en termes de manipulation des données et demandent des compétences particulières. « La manière d’écrire et de penser les applications est totalement différente. Il faut réfléchir en termes de nœuds et savoir qu’une défaillance est possible au niveau de chaque nœud », explique Barry Libenson. « La plupart des développeurs de logiciels qui travaillent le code SQL ne pensent pas ainsi ». Selon le CEO d’Experian, « il est difficile de trouver des gens qui savent travailler dans cette architecture ». Contrairement aux ingénieurs de bases de données chevronnés imprégnés du monde SQL, les personnes fraichement diplômées, les statisticiens et les spécialistes des données ont été formés à Hadoop. Mais, étant donné la guerre acharnée qu’il faut livrer pour recruter ces talents, celui-ci fait souvent travailler ensemble de jeunes diplômés et des spécialistes des données avec des ingénieurs SQL pour obtenir de meilleurs résultats de Hadoop.
Suite à la migration d’Experian vers Hadoop, les ingénieurs de la société peuvent supprimer les goulets d’étranglement qui apparaissent pendant la préparation des données et enrichir en information les produits de l’entreprise. Les banques, les entreprises de services financiers et d’autres entreprises peuvent également accéder aux comptes rendus et à d’autres produits via la nouvelle plate-forme API et l’architecture microservices d’Experian, découplée et moins dépendante de la fonction applicative. Par exemple, une entreprise de services financiers qui souhaite connaître la solvabilité d’un client ou vérifier l’historique des paiements sur une carte de crédit peut faire un appel d’API via Experian pour récupérer les données au lieu de télécharger et de passer par des applications pour accéder à la totalité des informations. « Aujourd’hui, la demande de microservices pour accéder à l’information est beaucoup plus forte que la demande d’applications traditionnelles sur site », a déclaré M. Libenson. « Toutes les institutions financières sont en train de passer à un modèle de microservices, et le système d’API convient très bien à la manière dont ils veulent consommer les informations ».
Passage en mode DevOps
Le passage d’Experian à des architectures plus modernes et modulaires – Hadoop, microservices et API – a également nécessité une refonte du développement logiciel. Les projets sont rigoureusement documentés et élaborés en plusieurs étapes pendant des mois, et les fonctionnalités sont ajoutées progressivement. Barry Libenson affirme que son département informatique a adopté des méthodologies agiles et DevOps pour construire des produits suffisamment viables, les tester et les affiner selon les besoins. Le passage à un modèle de cloud hybride, à une architecture de microservices et à une plate-forme API représente « un grand changement ». « Cette évolution va permettre à Experian de réduire les erreurs, de faire baisser les coûts et d’accélérer l’innovation », a déclaré M. Libenson.
Go to Source
Experian passe du mainframe à Hadoop pour traiter les dossiers crédit was originally published on JDCHASTA SAS
0 notes
Text
How to integrate Hadoop and Teradata using SQL-H
#ICYMI: I have tried Hadoop Connector for Teradata, Teradata Connector for Hadoop, Teradata Studio Express, Aster SQL-H, and many more cumbersome alternatives, finally to reach the Hadoop-Teradata integration without purchasing QueryGrid current version. However, without QueryGrid, you cannot do cross-platform querying. Here, we just demonstrate bidirectional data transfer between Teradata and Hadoop. All that I needed for Teradata seamlessly integrate with Hadoop were these: * Hadoop Sandbox 2.1 for VMware (http://hortonworks.com/hdp/downloads) * Teradata Express 15 for VMware (http://downloads.teradata.com/downloads) * Teradata Connector for Hadoop (TDCH) (http://downloads.teradata.com/downloads) * Teradata Studio (http://downloads.teradata.com/downloads) I didnt need to connect Teradata Aster, because all I needed was querying and data transfer between Hadoop and TD. Here is how it happened: 1. I converted the OVA file I got from Hortonworks Sandbox download page, into a VMX file for running into VMware Server. The command for converting is this ovftool.exe Hortonworks_Sandbox_2.1.ova D:/HDP_2.1_Extracted/HDP_2.1_vmware where HDP_2.1_vmware is the VMDK file extracted. The extraction took an hour on a fast server. 2. I loaded the HDP_2.1_vmware.vmdk into VMware Server by choosing to add a new virtual machine. VMDK file made the VMX as I specified the VM configurations. I chose NAT for network connection, also chose USB driver option for VM. When turning on the VM, it asked the question that SCSI device (USB) is not working so should the VM boot from IDE. Thats the recommended option so I chose it. VM worked, run and I could browse into Hortonworks Sandbox by typing http://sandbox.hortonworks.com:8000. I could also use the port 50070 to access WebHDFS. I just changed the password for hue in the user admin section of the site at http://sandbox.hortonworks.com:8000. 3. Now I needed to install Teradata 15 and Teradata Studio and connect the two. It worked well, and there is a lot of documentation to troubleshoot if anything comes in connecting TD15 to Teradata Studio. When I could not connect TD15 the first time, I got error in Teradata Administrator “Connection Refused”. I just restarted the SUSE Linux OS on which TD 15 VM resides, and I could connect well. 4. Now the last part was to install an RPM file of Teradata Connector for Hadoop (TDCH) in the Hadoop Hortonworks Sandbox I just launched in step 2. For this, I used Putty to connect to HDP2.1 shell. I put the IP designated to sandbox.hortonworks.com in PUTTY, and connected on default port 22. I logged in as root, hadoop as username, password. Then I went to /usr/lib/ . There were installations of java 1.7 , hive, sqoop, etc. I just needed to check that java version is 1.7 or above. Now using FileZilla I transferred TDCH rpm file to /usr/lib. Then I run the command to install rpm rpm -Uvh teradata-connector-1.3.2-hdp2.1.noarch.rpm It installed the rpm as verbose (-v), showing me all the details. 5. Now I needed to run the oozie configurations as specified on the Teradata Studio download page in the installation instuctions. namenode was set to sandbox.hortonworks.com . webHDFS hostname and webHDFS port need not be set as they default to name node and 50070 respectively which works. 6. Now open the Teradata Studio. Add a new database connection. Specify the Hadoop Database credentials including WebHDFS host name: sandbox.hortonworks.com WebHDFS port:50070 username: hue I tested the connection. Firstly ping failed. But after long pause of waiting, which meant that it was in the middle of processing. The java error exception showed “cannot access oozie service”. So I closed the root connection through PUTTY as I was first trying to give username root. I also later closed hue connections online on sandbox.hortonworks.com so that the connection does not get timed out. Then the ping succeeded after a 20 sec pause. 7. Once both Teradata and Hadoop Distributed File System were connected to Teradata Studio, I could transfer data to and from both databases. It is done. https://goo.gl/mujNZJ #DataScience #Cloud
0 notes
Text
Hortonworks Sandbox
Hortonworks Sandbox
Hortonworks Sandbox is a personal, portable Apache Hadoop environment that comes with dozens of interactive Hadoop and it's ecosystem tutorials and the most exciting developments from the latest HDP
New Post has been published on http://www.predictiveanalyticstoday.com/hortonworks-sandbox/
0 notes
Text
Post 13 | HDPCD | Data Transformation using Apache Pig
Post 13 | HDPCD | Data Transformation using Apache Pig
In the previous tutorial, we saw how to load the data from Apache Hive to Apache Pig. If you remember, we used HCatalog for performing that operation. In this tutorial, we are going to see the process of doing the data transformation using Apache Pig. The process of data transformation itself is too involved and can differ from one requirement to other.
For the certification, in simple words, I…
View On WordPress
#apache#apache pig#Big Data#big data certification#big data tutorial#certification#certification preparation#certified#Data Transformation#data transformation using apache pig#data transformation using pig#hadoop certification#HDPCD#hdpcd big data#hdpcd certification#Hortonworks#hortonworks big data#hortonworks certification#hortonworks data platform certified developer#hortonworks sandbox#pig#pig latin#pig transformation
0 notes