Quickstart Guide
Let’s get started with Treasure Data! Treasure Data is a Hadoop-based Big Data as-a-Service service. Forget about the servers, storage, and infrastructure needed to store your billions of records and focus on analyzing your data instead.
Table of Contents
Step 1: Sign-Up
Sign up for Treasure Data if you haven’t yet.
Step 2: Install the Treasure Data Toolbelt
Install the Treasure Data Toolbelt for your development environment. It contains the td command, a CLI tool for importing, managing, and querying your data.
- MacOS Installer
- Windows Installer
- Linux (Redhat/CentOS, Debian/Ubuntu)
- or ‘gem install td’ for those who are familiar with Ruby.
tdworks on both 1.8 and 1.9.
Step 3: Authorize
Once you have installed the toolbelt, you will have access to the td command from your command line. Authorize your account with the td account command. Please use the user name and password you used when signing up when prompted.
$ td account -f Enter your Treasure Data credentials. Email: k@treasure-data.com Password (typing will be hidden): Authenticated successfully.
Step 4: Import Sample Dataset
Now you’re ready! Let’s get our feet wet by importing a sample Apache log. You will first need to create a database and a table on the cloud via the CLI.
$ td db:create testdb $ td table:create testdb www_access Table 'testdb.www_access' is created.
Let’s generate a sample Apache log and import it to the cloud. td sample:apache generates 5,000 lines of Apache log data in JSON format. td table:import takes a JSON file and uploads it to the cloud.
$ td sample:apache apache.json $ td table:import testdb www_access --json apache.json $ tail -n 1 apache.json {"host":"200.129.205.208","user":"-","method":"GET","path":"/category/electronics","code":200,"referer":"-","size":62,"agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11","time":1334035906}
The records will be imported into our database after about 20-60 seconds.
$ td tables +----------+------------+------+-------+--------+ | Database | Table | Type | Count | Schema | +----------+------------+------+-------+--------+ | testdb | www_access | log | 5000 | | +----------+------------+------+-------+--------+
Step 5: Issue Queries
Finally, let’s issue a SQL query. Our Hadoop-based engine on the cloud executes your query and returns the result to you. The following query calculates the distribution of HTTP status codes.
$ td query -w -d testdb \ "SELECT v['code'] AS code, COUNT(1) AS cnt FROM www_access GROUP BY v['code']" queued... started at 2012-04-10T23:44:41Z 2012-04-10 23:43:12,692 Stage-1 map = 0%, reduce = 0% 2012-04-10 23:43:18,766 Stage-1 map = 100%, reduce = 0% 2012-04-10 23:43:29,925 Stage-1 map = 100%, reduce = 33% 2012-04-10 23:43:32,973 Stage-1 map = 100%, reduce = 100% Status : success Result : +------+------+ | code | cnt | +------+------+ | 404 | 17 | | 500 | 2 | | 200 | 4981 | +------+------+
The command above will take about 15-45 seconds, owing mainly to the overhead in setting up jobs within the cloud-based Hadoop engine.
What’s Next?
You’re now ready to import your real data to the cloud! The following tutorials will explain how to import your data (e.g. Application Logs, Middleware Logs) from various sources. For a deeper understanding of the platform, please refer to the architecture overview article.
Languages and Frameworks
| Supported Languages | ||
|---|---|---|
| Ruby or Rails | Java | Perl |
| Python | PHP | Scala |
| Node.js | ||
Middleware
If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels. Live chat with our staffs also work well.